Commit Graph

11340 Commits

Author SHA1 Message Date
Jeff Roberson
791d9a6e5d - Use DPCPU for SCHED_STATS. This is somewhat awkward because the
offset of the stat is not known until link time so we must emit a
   function to call SYSCTL_ADD_PROC rather than using SYSCTL_PROC
   directly.
 - Eliminate the atomic from SCHED_STAT_INC now that it's using per-cpu
   variables.  Sched stats are always incremented while we're holding
   a spinlock so no further protection is required.

Reviewed by:	sam
2009-06-25 01:33:51 +00:00
Jeff Roberson
e6e06a8b47 - Add a sysctl_dpcpu_long to support long typed pcpu stats.
- Remove the #ifndef SMP case as the SMP code works on UP as well.

Reviewed by:	sam
2009-06-25 01:31:59 +00:00
Jamie Gritton
499650a033 Wrap a PR_VNET inside "#ifdef VIMAGE" since that the only place it applies.
bz wants the blame for this.

Noticed by:	rwatson
Approved by:	bz (mentor)
2009-06-24 22:06:56 +00:00
John Baldwin
3899e0dfe7 Regen. 2009-06-24 21:54:08 +00:00
Jamie Gritton
6bb795633c In case of prisons with their own network stack, permit
additional privileges as well as not restricting the type of
sockets a user can open.

Note: the VIMAGE/vnet fetaure of of jails is still considered
      experimental and cannot guarantee that privileged users
      can be kept imprisoned if enabled.

Reviewed by:	rwatson
Approved by:	bz (mentor)
2009-06-24 21:39:50 +00:00
John Baldwin
b648d4806b Change the ABI of some of the structures used by the SYSV IPC API:
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
  short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
  short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
  (this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
  removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
  int.  This removes the need for the shm_bsegsz member in struct
  shmid_kernel and should allow for complete support of SYSV SHM regions
  >= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
  short.
- The shm_internal member of struct shmid_ds is now gone.  The internal
  VM object pointer for SHM regions has been moved into struct
  shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
  now marked COMPAT7 and new versions of those system calls which support
  the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc.  The
  FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
  symbol versions has been added to libc.  Version tags are added to
  system calls by adding an appropriate __sym_compat() entry to
  src/lib/libc/incldue/compat.h. [1]

PR:		kern/16195 kern/113218 bin/129855
Reviewed by:	arch@, rwatson
Discussed with:	kan, kib [1]
2009-06-24 21:10:52 +00:00
John Baldwin
45f48220bb Deprecate the msgsys(), semsys(), and shmsys() system calls by moving
them under COMPAT_FREEBSD[4567].  Starting with FreeBSD 5.0 the SYSV IPC
API was implemented via direct system calls (e.g. msgctl(), msgget(), etc.)
rather than indirecting through the var-args *sys() system calls.  The
shmsys() system call was already effectively deprecated for all but
COMPAT_FREEBSD4 already as its implementation for the !COMPAT_FREEBSD4 case
was to simply invoke nosys().
2009-06-24 20:01:13 +00:00
Ulf Lilleengen
2dd43ecaec - Similar to the previous commit, but for CURRENT: Fix a bug where a FIFO vnode
use count was increased twice, but only decreased once.
2009-06-24 18:44:38 +00:00
Ulf Lilleengen
7083246d7c - Fix a bug where a FIFO vnode use count was increased twice, but only
decreased once.

MFC after:	1 week
2009-06-24 18:38:51 +00:00
Jamie Gritton
bf8c1f990d Fix a race in vi_if_move, where a vnet is used after the prison that
referred to it has been released.

Approved by:	bz (mentor)
2009-06-24 15:29:36 +00:00
John Baldwin
3c366f1f14 Add a new COMPAT7 flag for FreeBSD 7.x compatibility system calls. 2009-06-24 13:36:37 +00:00
John Baldwin
71361470b1 - Move syscall function argument structure types to be just above the
relevenat system call function.
- Whitespace fixes.
2009-06-24 13:35:38 +00:00
Robert Watson
36fecbf302 Add stack_print_short() and stack_print_short_ddb() interfaces to
stack(9), which generate a more compact rendition of a stack trace
via the kernel's printf.

MFC after:	1 week
2009-06-24 12:06:15 +00:00
Jeff Roberson
50c202c592 Implement a facility for dynamic per-cpu variables.
- Modules and kernel code alike may use DPCPU_DEFINE(),
   DPCPU_GET(), DPCPU_SET(), etc. akin to the statically defined
   PCPU_*.  Requires only one extra instruction more than PCPU_* and is
   virtually the same as __thread for builtin and much faster for shared
   objects.  DPCPU variables can be initialized when defined.
 - Modules are supported by relocating the module's per-cpu linker set
   over space reserved in the kernel.  Modules may fail to load if there
   is insufficient space available.
 - Track space available for modules with a one-off extent allocator.
   Free may block for memory to allocate space for an extent.

Reviewed by:    jhb, rwatson, kan, sam, grehan, marius, marcel, stas
2009-06-23 22:42:39 +00:00
Jeff Roberson
c76ee82799 - Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULE
supports arbitrary numbers of cpus rather than being limited by
   cpumask_t to the number of bits in a long.
2009-06-23 22:12:37 +00:00
Ed Schouten
9801591468 Improve my last commit: use a separate condvar to serialize.
The advantage of using a separate condvar is that we can just use
cv_signal(9) instead of cv_broadcast(9). It makes no sense to wake up
multiple threads. It also makes the TTY code easier to understand.
t_dcdwait sounds totally unrelated.
2009-06-23 21:43:02 +00:00
Ed Schouten
2d41cf3a24 Use dcdwait to block threads to serialize writes.
I suspect the usage of bgwait causes a lot of spurious wakeups when
threads are blocked in the background, because they will be woken up
each time a write() call is performed.

Also wakeup dcdwait when the TTY is abandoned.
2009-06-23 21:33:26 +00:00
Konstantin Belousov
3364c323e6 Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with:	pho
Reviewed by:	alc
Approved by:	re (kensmith)
2009-06-23 20:45:22 +00:00
Jamie Gritton
b97457e2e6 Add a limit for child jails via the "children.cur" and "children.max"
parameters.  This replaces the simple "allow.jails" permission.

Approved by:	bz (mentor)
2009-06-23 20:35:51 +00:00
Bjoern A. Zeeb
5736e6fb9d After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.
2009-06-23 17:03:45 +00:00
Jamie Gritton
399645e1b9 Remove unnecessary/redundant includes.
Approved by:	bz (mentor)
2009-06-23 14:39:21 +00:00
Peter Holm
401679debe vn_open_cred() needs a non NULL ucred pointer
Reviewed by:	kib
2009-06-23 11:29:54 +00:00
Andre Oppermann
ef760e6ad2 Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
  the length of data to be moved into the uio compared to
  soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
  cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams.  Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets.  Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4:	r164817 //depot/user/andre/soreceive_stream/
2009-06-22 23:08:05 +00:00
Andre Oppermann
bc05b2f6fa Add m_mbuftouio() helper function to copy(out) an arbitrary
long mbuf chain into an arbitrary large uio in a single step.
It is a functional mirror image of m_uiotombuf().

This function is supposed to be used instead of hand rolled code
with the same purpose and to concentrate it into one place for
potential further optimization or hardware assistance.
2009-06-22 22:20:38 +00:00
Andre Oppermann
4ad1c464fe In sbappendstream_locked() demote all incoming packet mbufs (and
chains) to pure data mbufs using m_demote().  This removes the
packet header and all m_tag information as they are not meaningful
anymore on a stream socket where mbufs are linked through m->m_next.
Strictly speaking a packet header can be only ever valid on the first
mbuf in an m_next chain.

sbcompress() was doing this already when the mbuf chain layout lent
itself to it (e.g. header splitting or merge-append), just not
consistently.

This frees resources at socket buffer append time instead of at
sbdrop_internal() time after data has been read from the socket.

For MAC the per packet information has done its duty and during
socket buffer appending the policy of the socket itself takes over.
With the append the packet boundaries disappear naturally and with
it any context that was based on it.  None of the residual information
from mbuf headers in the socket buffer on stream sockets was looked at.
2009-06-22 21:46:40 +00:00
John Baldwin
e47a833f38 Regen. 2009-06-22 20:24:03 +00:00
John Baldwin
773fa740c9 Include definitions for the audit identifiers for compat system calls in
sysproto.h.  This makes it possible to use SYSCALL_MODULE() for compat
system calls that live in kernel modules.
2009-06-22 20:14:10 +00:00
John Baldwin
0b0fe06a5e Fix a typo in a comment. 2009-06-22 20:12:40 +00:00
Andre Oppermann
0cb6db1d21 Update m_demote:
- remove HT_HEADER test (MT_HEADER == MT_DATA for some time now)
- be more pedantic about m_nextpkt in other than first mbuf
- update m_flags to be retained
2009-06-22 19:35:39 +00:00
Konstantin Belousov
c808c9632d Add explicit struct ucred * argument for VOP_VPTOCNP, to be used by
vn_open_cred in default implementation. Valid struct ucred is needed for
audit and MAC, and curthread credentials may be wrong.

This further requires modifying the interface of vn_fullpath(9), but it
is out of scope of this change.

Reviewed by:	rwatson
2009-06-21 19:21:01 +00:00
Konstantin Belousov
e0c161b89c Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson
2009-06-21 13:41:32 +00:00
Roman Divacky
f7490cfe01 In non-debugging mode make this define (void)0 instead of nothing. This
helps to catch bugs like the below with clang.

	if (cond);		<--- note the trailing ;
	   something();

Approved by:	ed (mentor)
Discussed on:	current@
2009-06-21 07:54:47 +00:00
Brooks Davis
7f92e57820 Change crsetgroups_locked() (called by crsetgroups()) to sort the
supplemental groups using insertion sort.  Use this property in
groupmember() to let us use a binary search instead of the previous
linear search.
2009-06-20 20:29:21 +00:00
Ed Schouten
f8f6146082 Improve nested jail awareness of devfs by handling credentials.
Now that we start to use credentials on character devices more often
(because of MPSAFE TTY), move the prison-checks that are in place in the
TTY code into devfs.

Instead of strictly comparing the prisons, use the more common
prison_check() function to compare credentials. This means that
pseudo-terminals are only visible in devfs by processes within the same
jail and parent jails.

Even though regular users in parent jails can now interact with
pseudo-terminals from child jails, this seems to be the right approach.
These processes are also capable of interacting with the jailed
processes anyway, through signals for example.

Reviewed by:	kib, rwatson (older version)
2009-06-20 14:50:32 +00:00
Kip Macy
5b204a113c define helper routines for deferred mbuf initialization 2009-06-19 21:14:39 +00:00
Brooks Davis
838d985825 Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively.  (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer.  Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively.  Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary.  In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups.  When feasible, truncate
the group list rather than generating an error.

Minor changes:
  - Reduce the number of hand rolled versions of groupmember().
  - Do not assign to both cr_gid and cr_groups[0].
  - Modify ipfw to cache ucreds instead of part of their contents since
    they are immutable once referenced by more than one entity.

Submitted by:	Isilon Systems (initial implementation)
X-MFC after:	never
PR:		bin/113398 kern/133867
2009-06-19 17:10:35 +00:00
John Baldwin
afd9f91c63 Fix a deadlock in the getpeername() method for UNIX domain sockets.
Instead of locking the local unp followed by the remote unp, use the same
locking model as accept() and read lock the global link lock followed by
the remote unp while fetching the remote sockaddr.

Reported by:	Mel Flynn  mel.flynn of mailing.thruhere.net
Reviewed by:	rwatson
MFC after:	1 week
2009-06-18 20:56:22 +00:00
Alan Cox
c45c00344f Utilize the new function kmem_alloc_contig() to implement the UMA back-end
allocator for the jumbo frames zones.  This change has two benefits: (1) a
custom back-end deallocator is no longer required.  UMA's standard
deallocator suffices. (2) It eliminates a potentially confusing artifact
of using contigmalloc(): The malloc(9) statistics contain bogus information
about the usage of jumbo frames.  Specifically, the malloc(9) statistics
report all jumbo frames in use whereas the UMA zone statistics report the
"truth" about the number in use vs. the number free.
2009-06-18 17:59:04 +00:00
John Baldwin
4b7b144f63 Regen. 2009-06-17 19:53:47 +00:00
John Baldwin
21def99b51 - Add the ability to mix multiple flags seperated by pipe ('|') characters
in the type field of system call tables.  Specifically, one can now use
  the 'NO*' types as flags in addition to the 'COMPAT*' types.  For example,
  to tag 'COMPAT*' system calls as living in a KLD via NOSTD.  The COMPAT*
  type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
  makesyscalls.sh that return true if a requested flag is found in the
  type field ($3).  The flag() function checks all of the flags in the
  field, but type() only checks the first flag.  type() is meant to be
  used in the top-level "switch" statement and flag() should be used
  otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
  the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
  say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.
2009-06-17 19:50:38 +00:00
John Baldwin
0ec0b41cf6 Remove the now-unused NOIMPL flag. It serves no useful purpose given the
existing UNIMPL and NOSTD types.
2009-06-17 18:46:14 +00:00
John Baldwin
f425839118 - NOSTD results in lkmressys being used instead of lkmssys.
- Mark nfsclnt as UNIMPL.  It should have been NOSTD instead of NOIMPL back
  when it lived in nfsclient.ko, but it was removed from that a long time
  ago.
2009-06-17 18:44:15 +00:00
Bjoern A. Zeeb
ebd8672cc3 Add explicit includes for jail.h to the files that need them and
remove the "hidden" one from vimage.h.
2009-06-17 15:01:01 +00:00
Konstantin Belousov
f02c9d2858 Decrement state->ls_threads when vnode appeared to be doomed.
Reported and tested by:	pho
2009-06-17 12:43:04 +00:00
Attilio Rao
651175c9db Introduce support for adaptive spinning in lockmgr.
Actually, as it did receive few tuning, the support is disabled by
default, but it can opt-in with the option ADAPTIVE_LOCKMGRS.
Due to the nature of lockmgrs, adaptive spinning needs to be
selectively enabled for any interested lockmgr.
The support is bi-directional, or, in other ways, it will work in both
cases if the lock is held in read or write way.  In particular, the
read path is passible of further tunning using the sysctls
debug.lockmgr.retries and debug.lockmgr.loops .  Ideally, such sysctls
should be axed or compiled out before release.

Addictionally note that adaptive spinning doesn't cope well with
LK_SLEEPFAIL.  The reason is that many (and probabilly all) consumers
of LK_SLEEPFAIL are mainly interested in knowing if the interlock was
dropped or not in order to reacquire it and re-test initial conditions.
This directly interacts with adaptive spinning because lockmgr needs
to drop the interlock while spinning in order to avoid a deadlock
(further details in the comments inside the patch).

Final note: finding someone willing to help on tuning this with
relevant workloads would be either very important and appreciated.

Tested by:	jeff, pho
Requested by:	many
2009-06-17 01:55:42 +00:00
Konstantin Belousov
01ed174831 Do not use casts (int *)0 and (struct thread *)0 for the arguments of
vn_rdwr, use NULL.

Reviewed by:	jhb
MFC after:	1 week
2009-06-16 15:13:45 +00:00
Ed Schouten
eaaaf1906b Perform some more cleanups to in-kernel session handling.
The code that was in place in exit1() was mainly based on code from the
old TTY layer. The main reason behind this, was because at one moment I
ran a system that had two TTY layers in place at the same time. It is
now sufficient to do the following:

- Remove references from the session structure to the TTY vnode and the
  session leader.

- If we have a controlling TTY and the session used by the TTY is equal
  to our session, send the SIGHUP.

- If we have a vnode to the controlling TTY which has not been revoked,
  revoke it.

While there, change sys/kern/tty.c to use s_ttyp in the comparison
instead of s_ttyvp. It should not make any difference, because s_ttyvp
can only become null when the session leader already left, but it's
nicer to compare against the proper value.
2009-06-15 20:45:51 +00:00
John Baldwin
6653a58307 Regen. 2009-06-15 20:40:23 +00:00
John Baldwin
c4f16b69e1 Add a new 'void closefrom(int lowfd)' system call. When called, it closes
any open file descriptors >= 'lowfd'.  It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD.  One difference from other *BSD is that this closefrom() does not
fail with any errors.  In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR.  DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.

Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads.  As such, it is not
multithread safe.

Submitted by:	rwatson (initial version)
Reviewed by:	rwatson
MFC after:	2 weeks
2009-06-15 20:38:55 +00:00
Ed Schouten
9c373a81a3 Make tcsetsid(3) work on revoked TTYs.
Right now the only way to make tcsetsid(3)/TIOCSCTTY work, is by
ensuring the session leader is dead. This means that an application that
catches SIGHUPs and performs a sleep prevents us from assigning a new
session leader.

Change the code to make it work on revoked TTYs as well. This allows us
to change init(8) to make the shutdown script run in a more clean
environment.
2009-06-15 19:17:52 +00:00
Jamie Gritton
9ed47d01eb Get vnets from creds instead of threads where they're available, and from
passed threads instead of curthread.

Reviewed by:	zec, julian
Approved by:	bz (mentor)
2009-06-15 19:01:53 +00:00
Jamie Gritton
679e13901c Manage vnets via the jail system. If a jail is given the boolean
parameter "vnet" when it is created, a new vnet instance will be created
along with the jail.  Networks interfaces can be moved between prisons
with an ioctl similar to the one that moves them between vimages.
For now vnets will co-exist under both jails and vimages, but soon
struct vimage will be going away.

Reviewed by:	zec, julian
Approved by:	bz (mentor)
2009-06-15 18:59:29 +00:00
Jamie Gritton
c1f192193d Rename the host-related prison fields to be the same as the host.*
parameters they represent, and the variables they replaced, instead of
abbreviated versions of them.

Approved by:	bz (mentor)
2009-06-13 15:39:12 +00:00
Bjoern A. Zeeb
aed2a06174 Remove the static from int hardlink_check_uid.
There is an external use in the opensolaris code.

I am not sure how this ever worked but I have seen two reports of:
	link_elf: symbol hardlink_check_uid undefined
lately.

Reported by:	Scott Ullrich (sullrich gmail.com), pfsense
Reported by:	Mister Olli (mister.olli googlemail.com)
2009-06-13 13:09:20 +00:00
Jamie Gritton
7455b100af Add counterparts to getcredhostname:
getcreddomainname, getcredhostuuid, getcredhostid

Suggested by:	rmacklem
Approved by:	bz
2009-06-13 00:12:02 +00:00
Ed Schouten
13ace80b16 Revert my previous change, because it reintroduces an old regression.
Because our rc scripts also open the /etc/ttyv* nodes, it revokes the
console, preventing startup messages from being displayed.

I really have to think about this. Maybe we should just give the console
its own TTY and let it build on top of other TTYs. I'm still not sure
what to do with input handling there.
2009-06-12 21:21:17 +00:00
Ed Schouten
4650ad4cc0 Prevent yet another staircase effect bug in the console device.
Even though I thought I fixed the staircase issue (and I was no longer
able to reproduce it), I got some reports of the issue still being
there. It turns out the staircase effect still occurred when
/dev/console was kept open while killing the getty on the same TTY
(ttyv0).

For some reason I can't figure out how the old TTY code dealt with that,
so I assume the issue has always been there. I only exposed it more by
merging consolectl with ttyv0, which means that the issue was present,
even on systems without a serial console.

I'm now marking the console device as being closed when closing the
regular TTY device node. This means that when the getty shuts down,
init(8) will open /dev/console, which means the termios attributes will
always be reset in this case.
2009-06-12 20:29:55 +00:00
Paul Saab
a03fb243dc Stop asserting on exclusive locks in fsync since it can now support
shared vnode locking on ZFS.

Reviewed by:	jhb
2009-06-11 17:06:45 +00:00
Andriy Gapon
e76f11f441 strict kobj signatures: linker_if fixes
in symtab_get method symtab parameter is made constant as this reflects
actual intention and usage of the method

Reviewed by:	imp, current@
Approved by:	jhb (mentor)
2009-06-11 17:05:45 +00:00
Konstantin Belousov
d8b0556c6d Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by:	jhb [1]
Noted by:	jhb [2]
Reviewed by:	jhb
Tested by:	rnoland
2009-06-10 20:59:32 +00:00
Konstantin Belousov
5dd6aaba88 Do not leak the state->ls_lock after VI_DOOMED check introduced
in the r192683.

Reported by:	pho
Submitted by:	jhb
2009-06-10 16:17:38 +00:00
Bjoern A. Zeeb
c03528b663 SCTP needs either IPv4 or IPv6 as lower layer[1].
So properly hide the already #ifdef SCTP code with
#if defined(INET) || defined(INET6) as well to get us
closer to a non-INET/INET6 kernel.

Discussed with:	tuexen [1]
2009-06-10 14:36:59 +00:00
Konstantin Belousov
d6da640860 Fix r193923 by noting that type of a_fp is struct file *, not int.
It was assumed that r193923 was trivial change that cannot be done
wrong.

MFC after:	2 weeks
2009-06-10 14:24:31 +00:00
Konstantin Belousov
e4d9bdc105 s/a_fdidx/a_fp/ for VOP_OPEN comments that inline struct vop_open_args
definition.

Discussed with:	bde
MFC after:	2 weeks
2009-06-10 14:09:05 +00:00
Colin Percival
9a1bde1808 Prevent integer overflow in direct pipe write code from circumventing
virtual-to-physical page lookups. [09:09]

Add missing permissions check for SIOCSIFINFO_IN6 ioctl. [09:10]

Fix buffer overflow in "autokey" negotiation in ntpd(8). [09:11]

Approved by:	so (cperciva)
Approved by:	re (not really, but SVN wants this...)
Security:	FreeBSD-SA-09:09.pipe
Security:	FreeBSD-SA-09:10.ipv6
Security:	FreeBSD-SA-09:11.ntpd
2009-06-10 10:31:11 +00:00
Warner Losh
72aee35323 We can actually remove devclass_find_driver. 2009-06-10 01:02:38 +00:00
Warner Losh
98266dcf99 As discussed on arch@, restire
devclass_{add,delete,find,quiesce}_driver.  They aren't needed or used
and complicate locking newbus.
2009-06-09 23:24:04 +00:00
Jamie Gritton
e92e0574f9 Fix some overflow errors: a signed allocation and an insufficiant
array size.

Reported by:	pho
Tested by:	pho
Approved by:	bz (mentor)
2009-06-09 22:09:29 +00:00
Edward Tomasz Napierala
c0d345beb1 Add part of NFSv4 ACL kernel support code that is required for the upcoming
libc changes to work.  Not connected to the kernel build yet; for now,
it will be compiled into libc.

Reviewed by:	rwatson
2009-06-09 19:51:22 +00:00
Alan Cox
cbe9534615 Eliminate an instance of VM_PROT_READ_IS_EXEC that I overlooked in r190705. 2009-06-09 17:18:41 +00:00
John Baldwin
4ef60d2686 Add support for multiple passes of the device tree during the boot-time
probe.  The current device order is unchanged.  This commit just adds the
infrastructure and ABI changes so that it is easier to merge later changes
into 8.x.
- Driver attachments now have an associated pass level.  Attachments are
  not allowed to probe or attach to drivers until the system-wide pass level
  is >= the attachment's pass level.  By default driver attachments use the
  "last" pass level (BUS_PASS_DEFAULT).  Driver's that wish to probe during
  an earlier pass use EARLY_DRIVER_MODULE() instead of DRIVER_MODULE() which
  accepts the pass level as an additional parameter.
- A new method BUS_NEW_PASS has been added to the bus interface.  This
  method is invoked when the system-wide pass level is changed to kick off
  a rescan of the device tree so that drivers that have just been made
  "eligible" can probe and attach.
- The bus_generic_new_pass() function provides a default implementation of
  BUS_NEW_PASS().  It first allows drivers that were just made eligible for
  this pass to identify new child devices.  Then it propogates the rescan to
  child devices that already have an attached driver by invoking their
  BUS_NEW_PASS() method.  It also reprobes devices without a driver.
- BUS_PROBE_NOMATCH() is only invoked for devices that do not have
  an attached driver after being scanned during the final pass.
- The bus_set_pass() function is used during boot to raise the pass level.
  Currently it is only called once during root_bus_configure() to raise
  the pass level to BUS_PASS_DEFAULT.  This has the effect of probing all
  devices in a single pass identical to previous behavior.

Reviewed by:	imp
Approved by:	re (kib)
2009-06-09 14:26:23 +00:00
Paul Saab
27bfb741a0 Simply shared vnode locking and extend it to also include fsync.
Also, in vop_write, no longer assert for exclusive locks on the
vnode.

Reviewed by:	jhb, kmacy, jeffr
2009-06-08 21:23:54 +00:00
Bjoern A. Zeeb
8d8bc0182e After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
2009-06-08 19:57:35 +00:00
Marko Zec
bc29160df3 Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state.  The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers.  Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels.  Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by:	bz, julian
Approved by:	rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
Robert Watson
324fb6be0f Move zombie-reaping code out of kern_wait() and into its own function,
proc_reap().
Reviewed by:	jhb
MFC after:	3 days
Sponsored by:	Google, Inc.
2009-06-08 15:26:09 +00:00
Konstantin Belousov
ce3fa4d5cf Do not dereference vp->v_rdev without holding any of dev_mtx or vnode
lock. Use code similar to devfs_fp_check(), but inlined to feet other
checks performed by ttyhook_register().

Reviewed by:	ed
2009-06-08 13:34:45 +00:00
Alan Cox
0a2e596a93 Eliminate unnecessary obfuscation when testing a page's valid bits. 2009-06-07 19:38:26 +00:00
Alan Cox
0a276edef9 Eliminate an unused variable from allocbuf().
Eliminate the unnecessary setting of page valid bits from a non-VMIO buffer
in vm_hold_load_pages().
2009-06-07 18:19:04 +00:00
Alan Cox
d1a6e42ddd If vm_pager_get_pages() returns VM_PAGER_OK, then there is no need to check
the page's valid bits.  The page is guaranteed to be fully valid.  (For the
record, this is documented in vm/vm_pager.h's comments.)
2009-06-06 20:13:14 +00:00
Dag-Erling Smørgrav
322ef7cc60 Eliminate trailing_slash, which was made redundant in r193028.
Remove a couple of 4-year-old "temporary" KASSERTs.
Improve comments.

MFC after:	1 week
2009-06-06 00:49:49 +00:00
Joe Marcus Clarke
8a4444049e Unlock the cache lock before returning when we run out of buffer space
trying to fill in the full path name.

Reported by:	David Naylor <naylor.b.david@gmail.com>
Approved by:	kib
2009-06-05 16:44:42 +00:00
Ed Schouten
8888f696d0 Remove clists from the kernel.
Clists were originally used by the TTY layer as a text buffer interface.
The advantage of clists were that it would allocate a small set of
additional buffers that could be shared between TTYs when needed. In
the modern days we can just allocate some more KBs of memory to keep the
TTYs satisfied. The global cfreelist also requires synchronisation,
which may not be useful when trying to improve scalability.

The MPSAFE TTY layer uses its own text buffers (ttyinq and ttyoutq). We
had a small amount of drivers in the tree that still uses clists, like
the old USB stack and some keyboard drivers. With the old USB stack gone
and the keyboard drivers changed to use a circular buffer, we can safely
remove clists from the kernel.
2009-06-05 15:31:38 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
af4656042e Add mac_framework.h include missed when MAC code was (presumably) copied
from another file.
2009-06-05 14:23:24 +00:00
Brian Somers
f121baaa45 If we're passed garbage in malloc_init(), panic() rather than expecting
a KASSERT to handle it.  People are likely to turn off INVARIANTS RSN
and loading an old module can cause garbage-in here.

I saw the issue with an older nvidia driver (x11/nvidia-driver) loading
into a new kernel - a crash wasn't seen 'till sysctl_kern_malloc_stats().
I was lucky that mtp->ks_shortdesc was NULL and not something horrible.

While I'm here, KASSERT that malloc_uninit() isn't passed something that's
not in kmemstatistics.

MFC after:	3 weeks
2009-06-05 09:16:52 +00:00
Paul Saab
983b43c5de When checking for shared writes, use the struct mount returned from
vn_start_write.

Reviewed by:	jhb
2009-06-04 16:50:03 +00:00
Paul Saab
a6d545d8ed Support shared vnode locks for write operations when the offset is
provided on filesystems that support it.  This really improves mysql
+ innodb performance on ZFS.

Reviewed by:	jhb, kmacy, jeffr
2009-06-04 16:18:07 +00:00
Robert Watson
f93bfb23dc Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from:	TrustedBSD Project
2009-06-02 18:26:17 +00:00
Robert Watson
e499bd28dd Remove unneeded include.
MFC after:	3 days
2009-06-02 15:59:46 +00:00
Attilio Rao
f083018223 Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by:	pho
2009-06-02 13:03:35 +00:00
Alan Cox
3c33df624c Correct a boundary case error in the management of a page's dirty bits by
shm_dotruncate() and vnode_pager_setsize().  Specifically, if the length of
a shared memory object or a file is truncated such that the length modulo
the page size is between 1 and 511, then all of the page's dirty bits were
cleared.  Now, a dirty bit is cleared only if the corresponding block is
truncated in its entirety.
2009-06-02 08:02:27 +00:00
Jeff Roberson
f4471727f3 - Use an acquire barrier to increment f_count in fget_unlocked and
remove the volatile cast.  Describe the reason in detail in a comment.

Discussed with:	bde, jhb
2009-06-02 06:55:32 +00:00
John Baldwin
64345f0b57 Add an extension to the character device interface that allows character
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
  added to cdevsw.  This function is called for each mmap() request.
  If it returns ENODEV, then the mmap() request will fall back to using
  the device's device pager object and d_mmap().  Otherwise, the method
  can return a VM object to satisfy this entire mmap() request via
  *object.  It can also modify the starting offset into this object via
  *foff.  This allows device drivers to use the file offset as a cookie
  to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
  mapping V_CHR vnodes.  This avoids duplicating all the cdev mmap
  handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02.  Older device drivers
  using D_VERSION_01 are still supported.

MFC after:	1 month
2009-06-01 21:32:52 +00:00
John Baldwin
74fb0ba732 Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
  locked.  It is not permissible to call soisconnected() with this lock
  held; however, so socket upcalls now return an integer value.  The two
  possible values are SU_OK and SU_ISCONNECTED.  If an upcall returns
  SU_ISCONNECTED, then the soisconnected() will be invoked on the
  socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls.  The
  API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
  the receive socket buffer automatically.  Note that a SO_SND upcall
  should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
  instead of calling soisconnected() directly.  They also no longer need
  to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
  state machine, but other accept filters no longer have any explicit
  knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
  while invoking soreceive() as a temporary band-aid.  The plan for
  the future is to add a new flag to allow soreceive() to be called with
  the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
  buffer locked.  Previously sowakeup() would drop the socket buffer
  lock only to call aio_swake() which immediately re-acquired the socket
  buffer lock for the duration of the function call.

Discussed with:	rwatson, rmacklem
2009-06-01 21:17:03 +00:00
John Baldwin
b36cfff75d Add a simple API to manage scatter/gather lists of phyiscal addresses.
Each list describes a logical memory object that is backed by one or more
physical address ranges.  To minimize locking, the sglist objects
themselves are immutable once they are shared.

These objects may be used in the future to facilitate I/O requests using
physically-addressed buffers.  For the immediate future I plan to use them
to implement a new type of VM object and pager.

Reviewed by:	jeff, scottl
MFC after:	1 month
2009-06-01 20:35:39 +00:00
Robert Watson
820c518125 Add a flags field to struct ucred, and export that via kinfo_proc,
consuming one of its spare fields.  The cr_flags field is currently
unused, but will be used for features, including capability mode and
pay-as-you-go audit.

Discussed with:	jhb, sson
2009-06-01 20:26:51 +00:00
Robert Watson
33dd50646e Regenerate generated syscall files following changes to struct sysent in
r193234.
2009-06-01 16:14:38 +00:00
Robert Watson
46c10f272c Add 'sy_flags', a currently unused per-syscall entry flags field that will
see future use in 9-CURRENT and 8-STABLE for features such as the
capability-mode enable flag and pay-as-you-audit.

Discussed with:	jhb, sson
2009-06-01 16:13:06 +00:00
Robert Watson
d4b5cae49b Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
  workstream, or set of per-protocol queues.  Threads may be bound
  to specific CPUs, or allowed to migrate, based on a global policy.

  In the future it would be desirable to support topology-centric
  policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
  currently be one of:

  NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
    an implicit or explicit source (such as an interface or socket).

  NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
    as well as allowing protocols to provide a flow generation function
    for mbufs without flow identifers (m2flow).  Falls back on
    NETISR_POLICY_SOURCE if now flow ID is available.

  NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
    each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
  being used, as well as a mapping function from workstream to CPU ID,
  which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
  get and clear drop counters, which query data or apply changes across
  all workstreams.

- Add a more extensible netisr registration interface, in which
  protocols declare 'struct netisr_handler' structures for each
  registered NETISR_ type.  These include name, handler function,
  optional mbuf to flow ID function, optional mbuf to CPU ID function,
  queue limit, and ordering policy.  Padding is present to allow these
  to be expanded in the future.  If no queue limit is declared, then
  a default is used.

- Queue limits are now per-workstream, and raised from the previous
  IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
  with the exception of netnatm, default queue limits.  Most protocols
  register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
  NETISR_POLICY_FLOW, and will therefore take advantage of driver-
  generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
  the netisr, rather than having polling pretend to be two protocols.
  Provide two explicit hooks in the netisr worker for start and end
  events for runs: netisr_poll() and netisr_pollmore(), as well as a
  function, netisr_sched_poll(), to allow the polling code to schedule
  netisr execution.  DEVICE_POLLING still embeds single-netisr
  assumptions in its implementation, so for now if it is compiled into
  the kernel, a single and un-bound netisr thread is enforced
  regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking.  It will be enabled once further rmlock
optimization has taken place.  However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by:	bz
2009-06-01 10:41:38 +00:00
Alan Cox
6864a18c41 Eliminate a comment describing code that was deleted over eight years ago.
Move another comment to its proper place.  Fix a typo in a third comment.
2009-06-01 06:12:08 +00:00
Craig Rodrigues
0c349f0856 sys/boot/common.c
=================
Extend the loader to parse the root file system mount options in /etc/fstab,
and set a new loader variable vfs.root.mountfrom.options with these options.
The root mount options must be a comma-delimited string, as specified in
/etc/fstab.
Only set the vfs.root.mountfrom.options variable if it has not been
set in the environment.

sys/kern/vfs_mount.c
====================
When mounting the root file system, pass the mount options
specified in vfs.root.mountfrom.options, but filter out "rw" and "noro",
since the initial mount of the root file system must be done as "ro".
While we are here, try to add a few hints to the mountroot prompt
to give users and idea what might of gone wrong during mounting
of the root file system.

Reviewed by:	jhb (an earlier patch)
2009-06-01 01:02:30 +00:00
Alan Cox
1f17689408 nfs_write() can use the recently introduced vfs_bio_set_valid() instead of
vfs_bio_set_validclean(), thereby avoiding the page queues lock.

Garbage collect vfs_bio_set_validclean().  Nothing uses it any longer.
2009-05-31 20:18:02 +00:00
Konstantin Belousov
1358a7957d Unbreak the build. Add missed probes.
Reviewed by:	rwatson
Pointy hat to:	me
2009-05-31 20:16:06 +00:00
Konstantin Belousov
0449e6e1eb Eliminate code duplication in vn_fullpath1() around the cache lookups
and calls to vn_vptocnp() by moving more of the common code to
vn_vptocnp(). Rename vn_vptocnp() to vn_vptocnp_locked() to signify that
cache is locked around the call.

Do not track buffer position by both the pointer and offset, use only
buflen to record the start of the free space.

Export vn_vptocnp() for external consumers as a wrapper around
vn_vptocnp_locked() that locks the cache and handles hold counts.

Tested by:	pho
2009-05-31 14:57:43 +00:00
Dmitry Chagin
4202e1be20 Split native socketpair() syscall onto kern_socketpair() which should
be used by kernel consumers and socketpair() itself.

Approved by:	kib (mentor)
MFC after:	1 month
2009-05-31 12:12:38 +00:00
Marko Zec
feb08d06b9 Introduce an interm userland-kernel API for creating vnets and
assigning ifnets from one vnet to another.  Deletion of vnets is not
yet supported.

The interface is implemented as an ioctl extension so that no syscalls
had to be introduced.  This should be acceptable given that the new
interface will be used for a short / interim period only, until the
new jail management framwork gains the capability of managing vnets.
This method for managing vimages / vnets has been in use for the past
7 years without any observable issues.

The userland tool to be used in conjunction with the interim API can be
found in p4: //depot/projects/vimage-commit2/src/usr.sbin/vimage/... and
will most probably never get commited to svn.

While here, bump copyright notices in kern_vimage.c and vimage.h to
cover work done in year 2009.

Approved by:	julian (mentor)
Discussed with:	bz, rwatson
2009-05-31 12:10:04 +00:00
Nathan Whitehorn
f436f17508 Provide a new CPU device driver ivar to report the nominal speed of the
CPU, if available. This is meant to solve the issue of cpufreq misreporting
speeds on CPUs that boot in a reduced power mode and have only relative
speed control.
2009-05-31 08:59:15 +00:00
Attilio Rao
faef64cc39 Remove the now invalid (and possibly unused) debug.mpsafevfs
sysctl/tunable.

Reviewed by:	emaste
Sponsored by:	Sandvine Incorporated
2009-05-30 23:52:23 +00:00
Edward Tomasz Napierala
c97fcdba57 Add VOP_ACCESSX, which can be used to query for newly added V*
permissions, such as VWRITE_ACL.  For a filsystems that don't
implement it, there is a default implementation, which works
as a wrapper around VOP_ACCESS.

Reviewed by:	rwatson@
2009-05-30 13:59:05 +00:00
Jamie Gritton
76ca6f88da Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex.  Jails may
have their own host information, or they may inherit it from the
parent/system.  The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL.  The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by:	bz (mentor)
2009-05-29 21:27:12 +00:00
Alan Cox
623469c996 Modify vm_hold_load_pages() to allocate pages using VM_ALLOC_NOOBJ rather
than using the kernel object.  This allows the elimination of page queues
locking from vm_hold_free_pages().
2009-05-29 18:35:51 +00:00
Robert Watson
1191932a65 Minor style tweak. 2009-05-29 14:25:51 +00:00
Robert Watson
82b7a39cb3 Since sched_pin() and sched_unpin() are already inlined, don't manually
inline in rmlocks.
2009-05-29 14:20:10 +00:00
John Baldwin
a571ad41ae Remove extra cpu_spinwait() invocations. This should really only be used
in tight spin loops, not in these edge cases where we restart a much
larger loop only a few times.

Reviewed by:	attilio
2009-05-29 14:03:34 +00:00
John Baldwin
fa29f0236f Tweak a few comments on adaptive spinning. 2009-05-29 13:56:34 +00:00
Robert Watson
1a109c1cb0 Make the rmlock(9) interface a bit more like the rwlock(9) interface:
- Add rm_init_flags() and accept extended options only for that variation.
- Add a flags space specifically for rm_init_flags(), rather than borrowing
  the lock_init() flag space.
- Define flag RM_RECURSE to use instead of LO_RECURSABLE.
- Define flag RM_NOWITNESS to allow an rmlock to be exempt from WITNESS
  checking; this wasn't possible previously as rm_init() always passed
  LO_WITNESS when initializing an rmlock's struct lock.
- Add RM_SYSINIT_FLAGS().
- Rename embedded mutex in rmlocks to make it more obvious what it is.
- Update consumers.
- Update man page.
2009-05-29 10:52:37 +00:00
Dag-Erling Smørgrav
32bf7cdf5a Let vfs_lookup() return ENOTDIR if the path has a trailing slash and
the last component is a symlink to something that isn't a directory.

We introduce a new namei flag, TRAILINGSLASH, which is set by lookup()
if the last component is followed by a slash.  The trailing slash is
then stripped, as before.  If the final component is a symlink,
lookup() will return to namei(), which will expand the symlink and
call lookup() with the new path.  When all symlinks have been
resolved, lookup() checks if the TRAILINGSLASH flag is set, and if it
is, and the vnode it ended up with is not a directory, it returns
ENOTDIR.

PR:		kern/21768
Submitted by:	Eygene Ryabinkin <rea-fbsd@codelabs.ru>
MFC after:	3 weeks
2009-05-29 10:02:44 +00:00
Dag-Erling Smørgrav
b181c8aac6 Fix misleading comment.
MFC after:	1 week
2009-05-29 09:52:13 +00:00
Attilio Rao
e31d083357 The patch for r193011 was partially rejected when applied, complete it. 2009-05-29 08:01:48 +00:00
Ed Schouten
c5e30cc02b Last minute TTY API change: remove mutex argument from tty_alloc().
I don't want people to override the mutex when allocating a TTY. It has
to be there, to keep drivers like syscons happy. So I'm creating a
tty_alloc_mutex() which can be used in those cases. tty_alloc_mutex()
should eventually be removed.

The advantage of this approach, is that we can just remove a function,
without breaking the regular API in the future.
2009-05-29 06:41:23 +00:00
Attilio Rao
1ae1c2a3bd Reverse the logic for ADAPTIVE_SX option and enable it by default.
Introduce for this operation the reverse NO_ADAPTIVE_SX option.
The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed
and the new flag, offering the reversed logic, SX_NOADAPTIVE is added.

Additively implements adaptive spininning for sx held in shared mode.
The spinning limit can be handled through sysctls in order to be tuned
while the code doesn't reach the release, after which time they should
be dropped probabilly.

This change has made been necessary by recent benchmarks where it does
improve concurrency of workloads in presence of high contention
(ie. ZFS).

KPI breakage is documented by __FreeBSD_version bumping, manpage and
UPDATING updates.

Requested by:	jeff, kmacy
Reviewed by:	jeff
Tested by:	pho
2009-05-29 01:49:27 +00:00
Zachary Loafman
cfeb7489c2 fail(9) support:
Add support for kernel fault injection using KFAIL_POINT_* macros and
fail_point_* infrastructure. Add example fail point in vfs_bio.c to
simulate VM buf pressure.

Approved by:        dfr (mentor)
2009-05-27 16:36:54 +00:00
Jamie Gritton
0304c73163 Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails.  Child jails may be restricted more than their parents,
but never less.  Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system.  Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings.  The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by:	bz (mentor)
2009-05-27 14:11:23 +00:00
Stacey Son
00a5db46de Add the ksyms(4) pseudo driver. The ksyms driver allows a process to
get a quick snapshot of the kernel's symbol table including the symbols
from any loaded modules (the symbols are all merged into one symbol
table).  Unlike like other implementations, this ksyms driver maps
memory in the process memory space to store the snapshot at the time
/dev/ksyms is opened.  It also checks to see if the process has already
a snapshot open and won't allow it to open /dev/ksyms it again until it
closes first.  This prevents kernel and process memory from being
exhausted.  Note that /dev/ksyms is used by the lockstat(1) command.

Reviewed by:	gallatin kib (freebsd-arch)
Approved by:	gnn (mentor)
2009-05-26 21:39:09 +00:00
Stacey Son
a5aedd68b4 Add the OpenSolaris dtrace lockstat provider. The lockstat provider
adds probes for mutexes, reader/writer and shared/exclusive locks to
gather contention statistics and other locking information for
dtrace scripts, the lockstat(1M) command and other potential
consumers.

Reviewed by:	attilio jhb jb
Approved by:	gnn (mentor)
2009-05-26 20:28:22 +00:00
Ed Schouten
8b0d29d858 Get rid of M_TEMP. 2009-05-26 18:33:36 +00:00
Pawel Jakub Dawidek
ce332f1e67 Add missing socket options. 2009-05-26 09:19:21 +00:00
Konstantin Belousov
8af54d4cfc The advisory lock may be activated or activated and removed during the
sleep waiting for conditions when the lock may be granted.
To prevent lf_setlock() from accessing possibly freed memory, add reference
counting to the struct lockf_entry. Bump refcount around the sleep.

Make lf_free_lock() return non-zero when structure was freed, and use
this after the sleep to return EINTR to the caller. The error code might
need a clarification, but we cannot return success to usermode, since
the lock is not owned anymore.

Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2009-05-24 12:39:38 +00:00
Konstantin Belousov
9727972e2c In lf_purgelocks(), assert that state->ls_pending is empty after we
weeded out threads, and clean ls_active instead of ls_pending.

Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2009-05-24 12:37:55 +00:00
Konstantin Belousov
b33d617717 In lf_advlockasync(), recheck for doomed vnode after the state->ls_lock
is acquired. In the lf_purgelocks(), assert that vnode is doomed and set
*statep to NULL before clearing ls_pending list. Otherwise, we allow for
the thread executing lf_advlockasync() to put new pending entry after
state->ls_lock is dropped in lf_purgelocks().

Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2009-05-24 12:33:16 +00:00
Ed Schouten
47e6a3971f Block when initially opening a TTY multiple times.
In the original MPSAFE TTY code, I changed the behaviour by returning
EBUSY. I thought this made more sense, because it's basically a race to
see who gets the TTY first.

It turns out this is not a good change, because it also causes EBUSY to
be returned when another process is closing the TTY. This can happen
during startup, when /etc/rc (or one of its children) is still busy
draining its data and /sbin/init is attempting to open the TTY to spawn
a getty.

Reported by:	bz
Tested by:	bz
2009-05-24 12:32:03 +00:00
Konstantin Belousov
8aec91b5e8 Replace the while statement with the if for clarity. The loop body
cannot be executed more then once.

Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2009-05-24 12:28:38 +00:00
Marko Zec
37f17770e0 V_irtualize the if_clone framework, thus allowing for clonable ifnets
to optionally have overlapping unit numbers if attached in different
vnets.

At this stage if_loop is the only clonable ifnet class that has been
extended to allow for such overlapping allocation of unit numbers, i.e.
in each vnet it is possible to have a lo0 interface.  Other clonable ifnet
classes remain to operate with traditional semantics, i.e. each instance
of a clonable ifnet will be assigned a globally unique unit number,
regardless in which vnet such an ifnet becomes instantiated.

While here, garbage collect unused _lo_list field in struct vnet_net,
as well as improve indentation for #defines in sys/net/vnet.h.

The layout of struct vnet_net has changed, therefore bump
__FreeBSD_version.

This change has no functional impact on nooptions VIMAGE kernel builds.

Reviewed by:	bz, brooks
Approved by:	julian (mentor)
2009-05-23 21:43:44 +00:00
Jamie Gritton
1e2a13e62a Delay an error message until the variable it uses gets initialized.
Found with:	Coverity Prevent(tm)
CID:		4316
Reported by:	trasz
Approved by:	bz (mentor)
2009-05-23 16:13:26 +00:00
Marko Zec
e0c14af9b3 Introduce the if_vmove() function, which will be used in the future
for reassigning ifnets from one vnet to another.

if_vmove() works by calling a restricted subset of actions normally
executed by if_detach() on an ifnet in the current vnet, and then
switches to the target vnet and executes an appropriate subset of
if_attach() actions there.

if_attach() and if_detach() have become wrapper functions around
if_attach_internal() and if_detach_internal(), where the later
variants have an additional argument, a flag indicating whether a
full attach or detach sequence is to be executed, or only a
restricted subset suitable for moving an ifnet from one vnet to
another.  Hence, if_vmove() will not call if_detach() and if_attach()
directly, but will call the if_detach_internal() and
if_attach_internal() variants instead, with the vmove flag set.

While here, staticize ifnet_setbyindex() since it is not referenced
from outside of sys/net/if.c.

Also rename ifccnt field in struct vimage to ifcnt, and do some minor
whitespace garbage collection where appropriate.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by:	bz, rwatson, brooks?
Approved by:	julian (mentor)
2009-05-22 22:09:00 +00:00
Edward Tomasz Napierala
ae1add4e55 Make 'struct acl' larger, as required to support NFSv4 ACLs. Provide
compatibility interfaces in both kernel and libc.

Reviewed by:	rwatson
2009-05-22 15:56:43 +00:00
Ed Schouten
52f542a8e4 Enable secure TTY input buffer flushing by default.
I'm leaving the sysctl there. If people really notice a slowdown, they
can revert to the old behaviour.

Discussed with:	kib
2009-05-21 16:48:06 +00:00
Ed Schouten
770c15f60f Add a new sysctl: kern.tty_inq_flush_secure.
When enabled all TTY input queue buffers are zeroed when flushing or
closing the TTY. Because TTY input queues are also used to store filled
in passwords, this may be an interesting switch to enable for security
minded people.
2009-05-21 16:19:54 +00:00
John Baldwin
d422da9a0a Only use the ABI compat shim for vfs.bufspace if the old buffer is smaller
than a long.

PR:		amd64/134786
Submitted by:	Emil Mikulic  emikulic| gmail
MFC after:	3 days
2009-05-21 16:18:45 +00:00
Attilio Rao
9995e57b01 Move the M_WAITOK flag in notify() into an M_NOWAIT one in order to match
the behaviour alredy present with the further malloc() call in
devctl_notify().
This fixes a bug in the CAM layer where the camisr handler finished to
call camperiphfree() (and subsequently destroy_dev() resulting in a new
dev notify) while the xpt lock is held.

PR:		kern/130330
Tested by:	Riccardo Torrini <riccardo dot torrini at esaote dot com>
2009-05-21 13:22:07 +00:00
John Baldwin
6ca33ea345 Set the umask in a new file descriptor table earlier in fdcopy() to remove
two lock operations.
2009-05-20 18:42:04 +00:00
John Baldwin
583220dc4c Remove an obsolete assertion. We always wake up all waiters when unlocking
a mutex and never set the lock cookie == MTX_CONTESTED.
2009-05-20 18:29:14 +00:00
John Baldwin
4ab9c8af92 Fix a typo. 2009-05-20 17:19:30 +00:00
Warner Losh
248343f9d1 We no longer need to use d_thread_t for portability here, switch to
struct thread *.
2009-05-20 16:58:16 +00:00
Kip Macy
126f8425c3 Add minimal ZFS lock hierarchy 2009-05-20 02:51:48 +00:00
Robert Watson
56a3c6d4a7 With SMPng, DEVICE_POLLING uses its own idle threads, rather than the
system idle loop, to run ether_poll(), so make ether_poll() static.

MFC after:	1 week
2009-05-19 19:21:25 +00:00
Andriy Gapon
51ca6cd6df sysctl_rman: report shared resources to devinfo
shared uses of a resource are recorded on a sub-list hanging off
a main resource object on a main resource list;
without this change a shared resource (e.g. irq) is reported only
once by devinfo -r/-u;
with this change the resource is reported for each driver that
allocates it (which is even more than what vmstat -i -a reports).

Approved by:	jhb (mentor)
2009-05-19 14:08:21 +00:00
Robert Watson
e84bcd8494 Binding interrupts to a CPU consists of two parts: setting up CPU
affinity for the interrupt thread, and requesting that underlying
hardware direct interrupts to the CPU.  For software interrupt
threads, implement a no-op interrupt event binder that returns
success, so that the interrupt management code will just set the
ithread's affinity and succeed.

Reviewed by:	jhb
MFC after:	1 week
2009-05-18 14:02:55 +00:00
Ed Schouten
c383c2211b Mark the clock sysctls as MPSAFE.
These sysctls don't need any form of locking. At least cp_times is used
by powerd very often, which means I get 50% less calls to non-MPSAFE
sysctls on my system. The other 50% is consumed by dev.cpu.0.freq, but
this seems to need Giant for Newbus.
2009-05-18 12:03:43 +00:00
Alan Cox
1be5269359 Several changes to vfs_bio_clrbuf():
Provide a more descriptive comment.

Eliminate dead code.  The page cannot possibly have PG_ZERO set.

Eliminate unnecessary blank lines.

Reviewed by:	tegge
2009-05-17 23:25:53 +00:00