Commit Graph

11128 Commits

Author SHA1 Message Date
attilio
e05714ba70 Reverse the logic for ADAPTIVE_SX option and enable it by default.
Introduce for this operation the reverse NO_ADAPTIVE_SX option.
The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed
and the new flag, offering the reversed logic, SX_NOADAPTIVE is added.

Additively implements adaptive spininning for sx held in shared mode.
The spinning limit can be handled through sysctls in order to be tuned
while the code doesn't reach the release, after which time they should
be dropped probabilly.

This change has made been necessary by recent benchmarks where it does
improve concurrency of workloads in presence of high contention
(ie. ZFS).

KPI breakage is documented by __FreeBSD_version bumping, manpage and
UPDATING updates.

Requested by:	jeff, kmacy
Reviewed by:	jeff
Tested by:	pho
2009-05-29 01:49:27 +00:00
zml
b5e46da5a4 fail(9) support:
Add support for kernel fault injection using KFAIL_POINT_* macros and
fail_point_* infrastructure. Add example fail point in vfs_bio.c to
simulate VM buf pressure.

Approved by:        dfr (mentor)
2009-05-27 16:36:54 +00:00
jamie
a013e0afcb Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails.  Child jails may be restricted more than their parents,
but never less.  Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system.  Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings.  The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by:	bz (mentor)
2009-05-27 14:11:23 +00:00
sson
527dd57555 Add the ksyms(4) pseudo driver. The ksyms driver allows a process to
get a quick snapshot of the kernel's symbol table including the symbols
from any loaded modules (the symbols are all merged into one symbol
table).  Unlike like other implementations, this ksyms driver maps
memory in the process memory space to store the snapshot at the time
/dev/ksyms is opened.  It also checks to see if the process has already
a snapshot open and won't allow it to open /dev/ksyms it again until it
closes first.  This prevents kernel and process memory from being
exhausted.  Note that /dev/ksyms is used by the lockstat(1) command.

Reviewed by:	gallatin kib (freebsd-arch)
Approved by:	gnn (mentor)
2009-05-26 21:39:09 +00:00
sson
c0d5996eb6 Add the OpenSolaris dtrace lockstat provider. The lockstat provider
adds probes for mutexes, reader/writer and shared/exclusive locks to
gather contention statistics and other locking information for
dtrace scripts, the lockstat(1M) command and other potential
consumers.

Reviewed by:	attilio jhb jb
Approved by:	gnn (mentor)
2009-05-26 20:28:22 +00:00
ed
a3ca9cc9a8 Get rid of M_TEMP. 2009-05-26 18:33:36 +00:00
pjd
d86eeb0dbe Add missing socket options. 2009-05-26 09:19:21 +00:00
kib
862b0fc4e3 The advisory lock may be activated or activated and removed during the
sleep waiting for conditions when the lock may be granted.
To prevent lf_setlock() from accessing possibly freed memory, add reference
counting to the struct lockf_entry. Bump refcount around the sleep.

Make lf_free_lock() return non-zero when structure was freed, and use
this after the sleep to return EINTR to the caller. The error code might
need a clarification, but we cannot return success to usermode, since
the lock is not owned anymore.

Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2009-05-24 12:39:38 +00:00
kib
c54d127bf0 In lf_purgelocks(), assert that state->ls_pending is empty after we
weeded out threads, and clean ls_active instead of ls_pending.

Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2009-05-24 12:37:55 +00:00
kib
fec8f2f845 In lf_advlockasync(), recheck for doomed vnode after the state->ls_lock
is acquired. In the lf_purgelocks(), assert that vnode is doomed and set
*statep to NULL before clearing ls_pending list. Otherwise, we allow for
the thread executing lf_advlockasync() to put new pending entry after
state->ls_lock is dropped in lf_purgelocks().

Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2009-05-24 12:33:16 +00:00
ed
cb56810df2 Block when initially opening a TTY multiple times.
In the original MPSAFE TTY code, I changed the behaviour by returning
EBUSY. I thought this made more sense, because it's basically a race to
see who gets the TTY first.

It turns out this is not a good change, because it also causes EBUSY to
be returned when another process is closing the TTY. This can happen
during startup, when /etc/rc (or one of its children) is still busy
draining its data and /sbin/init is attempting to open the TTY to spawn
a getty.

Reported by:	bz
Tested by:	bz
2009-05-24 12:32:03 +00:00
kib
56e5e587c8 Replace the while statement with the if for clarity. The loop body
cannot be executed more then once.

Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2009-05-24 12:28:38 +00:00
zec
48f748dc29 V_irtualize the if_clone framework, thus allowing for clonable ifnets
to optionally have overlapping unit numbers if attached in different
vnets.

At this stage if_loop is the only clonable ifnet class that has been
extended to allow for such overlapping allocation of unit numbers, i.e.
in each vnet it is possible to have a lo0 interface.  Other clonable ifnet
classes remain to operate with traditional semantics, i.e. each instance
of a clonable ifnet will be assigned a globally unique unit number,
regardless in which vnet such an ifnet becomes instantiated.

While here, garbage collect unused _lo_list field in struct vnet_net,
as well as improve indentation for #defines in sys/net/vnet.h.

The layout of struct vnet_net has changed, therefore bump
__FreeBSD_version.

This change has no functional impact on nooptions VIMAGE kernel builds.

Reviewed by:	bz, brooks
Approved by:	julian (mentor)
2009-05-23 21:43:44 +00:00
jamie
e446b3e48f Delay an error message until the variable it uses gets initialized.
Found with:	Coverity Prevent(tm)
CID:		4316
Reported by:	trasz
Approved by:	bz (mentor)
2009-05-23 16:13:26 +00:00
zec
363a644ce6 Introduce the if_vmove() function, which will be used in the future
for reassigning ifnets from one vnet to another.

if_vmove() works by calling a restricted subset of actions normally
executed by if_detach() on an ifnet in the current vnet, and then
switches to the target vnet and executes an appropriate subset of
if_attach() actions there.

if_attach() and if_detach() have become wrapper functions around
if_attach_internal() and if_detach_internal(), where the later
variants have an additional argument, a flag indicating whether a
full attach or detach sequence is to be executed, or only a
restricted subset suitable for moving an ifnet from one vnet to
another.  Hence, if_vmove() will not call if_detach() and if_attach()
directly, but will call the if_detach_internal() and
if_attach_internal() variants instead, with the vmove flag set.

While here, staticize ifnet_setbyindex() since it is not referenced
from outside of sys/net/if.c.

Also rename ifccnt field in struct vimage to ifcnt, and do some minor
whitespace garbage collection where appropriate.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by:	bz, rwatson, brooks?
Approved by:	julian (mentor)
2009-05-22 22:09:00 +00:00
trasz
fb57d2691e Make 'struct acl' larger, as required to support NFSv4 ACLs. Provide
compatibility interfaces in both kernel and libc.

Reviewed by:	rwatson
2009-05-22 15:56:43 +00:00
ed
e309996070 Enable secure TTY input buffer flushing by default.
I'm leaving the sysctl there. If people really notice a slowdown, they
can revert to the old behaviour.

Discussed with:	kib
2009-05-21 16:48:06 +00:00
ed
77fae8a219 Add a new sysctl: kern.tty_inq_flush_secure.
When enabled all TTY input queue buffers are zeroed when flushing or
closing the TTY. Because TTY input queues are also used to store filled
in passwords, this may be an interesting switch to enable for security
minded people.
2009-05-21 16:19:54 +00:00
jhb
35ba8bce41 Only use the ABI compat shim for vfs.bufspace if the old buffer is smaller
than a long.

PR:		amd64/134786
Submitted by:	Emil Mikulic  emikulic| gmail
MFC after:	3 days
2009-05-21 16:18:45 +00:00
attilio
68353e273f Move the M_WAITOK flag in notify() into an M_NOWAIT one in order to match
the behaviour alredy present with the further malloc() call in
devctl_notify().
This fixes a bug in the CAM layer where the camisr handler finished to
call camperiphfree() (and subsequently destroy_dev() resulting in a new
dev notify) while the xpt lock is held.

PR:		kern/130330
Tested by:	Riccardo Torrini <riccardo dot torrini at esaote dot com>
2009-05-21 13:22:07 +00:00
jhb
ebdd571432 Set the umask in a new file descriptor table earlier in fdcopy() to remove
two lock operations.
2009-05-20 18:42:04 +00:00
jhb
53d01dc702 Remove an obsolete assertion. We always wake up all waiters when unlocking
a mutex and never set the lock cookie == MTX_CONTESTED.
2009-05-20 18:29:14 +00:00
jhb
033485e00c Fix a typo. 2009-05-20 17:19:30 +00:00
imp
d339e0404f We no longer need to use d_thread_t for portability here, switch to
struct thread *.
2009-05-20 16:58:16 +00:00
kmacy
879984a728 Add minimal ZFS lock hierarchy 2009-05-20 02:51:48 +00:00
rwatson
9c1bc41813 With SMPng, DEVICE_POLLING uses its own idle threads, rather than the
system idle loop, to run ether_poll(), so make ether_poll() static.

MFC after:	1 week
2009-05-19 19:21:25 +00:00
avg
89d59b82b3 sysctl_rman: report shared resources to devinfo
shared uses of a resource are recorded on a sub-list hanging off
a main resource object on a main resource list;
without this change a shared resource (e.g. irq) is reported only
once by devinfo -r/-u;
with this change the resource is reported for each driver that
allocates it (which is even more than what vmstat -i -a reports).

Approved by:	jhb (mentor)
2009-05-19 14:08:21 +00:00
rwatson
d9e163e093 Binding interrupts to a CPU consists of two parts: setting up CPU
affinity for the interrupt thread, and requesting that underlying
hardware direct interrupts to the CPU.  For software interrupt
threads, implement a no-op interrupt event binder that returns
success, so that the interrupt management code will just set the
ithread's affinity and succeed.

Reviewed by:	jhb
MFC after:	1 week
2009-05-18 14:02:55 +00:00
ed
a6c06ba89c Mark the clock sysctls as MPSAFE.
These sysctls don't need any form of locking. At least cp_times is used
by powerd very often, which means I get 50% less calls to non-MPSAFE
sysctls on my system. The other 50% is consumed by dev.cpu.0.freq, but
this seems to need Giant for Newbus.
2009-05-18 12:03:43 +00:00
alc
c8b00d493e Several changes to vfs_bio_clrbuf():
Provide a more descriptive comment.

Eliminate dead code.  The page cannot possibly have PG_ZERO set.

Eliminate unnecessary blank lines.

Reviewed by:	tegge
2009-05-17 23:25:53 +00:00
alc
dc942dabcf Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This
eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg().

In collaboration with:	tegge
2009-05-17 20:26:00 +00:00
ed
44192767f8 Print an extra newline when not at the first column already.
This makes siginfo output look a lot better when pressing it the first
time when in sh(1), for example:

	$ load: 0.00  cmd: sh 1945 [ttyin] 3.94r 0.00u 0.00s 0% 1960k
	load: 0.00  cmd: sh 1945 [ttyin] 4.19r 0.00u 0.00s 0% 1960k

will now become:

	$
	load: 0.00  cmd: sh 1945 [ttyin] 3.94r 0.00u 0.00s 0% 1960k
	load: 0.00  cmd: sh 1945 [ttyin] 4.19r 0.00u 0.00s 0% 1960k
2009-05-17 16:17:48 +00:00
ed
69acbef6ca Several cleanups to tty_info(), better known as Ctrl-T.
- Only pick up PROC_LOCK once, which means we can drop the PGRP_LOCK
  right after picking up PROC_LOCK for the first time.

- Print the process real time, making it consistent with tools like
  time(1).

- Use `p' and `td' to reference the process/thread we are going to
  print. Only use pick-variables inside the loops. We already did this
  for the threads, but not the processes.
2009-05-17 12:30:25 +00:00
des
9919b017d4 Remove do-nothing code that was required to dirty the old buffer on Alpha.
Coverity ID:	838
Approved by:	jhb, alc
2009-05-15 21:34:58 +00:00
kib
b8162aa0c9 Revert r192094. The revision caused problems for sysctl(3) consumers
that expect that oldlen is filled with required buffer length even when
supplied buffer is too short and returned error is ENOMEM.

Redo the fix for kern.proc.filedesc, by reverting the req->oldidx when
remaining buffer space is too short for the current kinfo_file structure.
Also, only ignore ENOMEM. We have to convert ENOMEM to no error condition
to keep existing interface for the sysctl, though.

Reported by:	ed, Florian Smeets <flo kasimir com>
Tested by:	pho
2009-05-15 14:41:44 +00:00
jhb
cbf4ebe5a3 - Use a separate sx lock to try to limit the number of concurrent userland
sysctl requests to avoid wiring too much user memory.  Only grab this
  lock if the user's old buffer is larger than a page as a tradeoff to
  allow more concurrency for common small requests.
- Just use a shared lock on the sysctl tree for user sysctl requests now.

MFC after:	1 week
2009-05-14 22:01:32 +00:00
kib
7361db2279 Do not advance req->oldidx when sysctl_old_user returning an
error due to copyout failure or short buffer.

The later breaks the usermode iterators of the sysctl results that pack
arbitrary number of variable-sized structures. Iterator expects that
kernel filled exactly oldlen bytes, and tries to interpret half-filled
or garbage structure at the end of the buffer. In particular,
kinfo_getfile(3) segfaulted.

Reported and tested by:	pho
MFC after:	3 weeks
2009-05-14 10:54:57 +00:00
jeff
20397e6431 - Implement a lockless file descriptor lookup algorithm in
fget_unlocked().
 - Save old file descriptor tables created on expansion until
   the entire descriptor table is freed so that pointers may be
   followed without regard for expanders.
 - Mark the file zone as NOFREE so we may attempt to reference
   potentially freed files.
 - Convert several fget_locked() users to fget_unlocked().  This
   requires us to manage reference counts explicitly but reduces
   locking overhead in the common case.
2009-05-14 03:24:22 +00:00
alc
82da6bfdea Eliminate page queues locking from bufdone_finish() through the
following changes:

Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does.  Suggested by: tegge

Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies.  Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.

Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.

Introduce vm_page_set_valid().

Reviewed by:	tegge
2009-05-13 05:39:39 +00:00
trasz
e6d7976851 Add missing 'break' statement.
Found with:	Coverity Prevent(tm)
CID:		3919
2009-05-12 17:05:40 +00:00
kib
60c4168558 Prevent overflow of uio_resid.
Noted by:	jhb
MFC after:	3 days
2009-05-11 19:58:03 +00:00
attilio
c639aa3d25 Fix a kernel compilation error, introduced after r191990, by defining
thread with curthread in the AUDIT case.

Reported by:	dchagin
2009-05-11 16:32:58 +00:00
attilio
1dcb84131b Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
alc
123b385c44 Revert CVS revision 1.94 (svn r16840). Current pmap implementations don't
suffer from the race condition that motivated revision 1.94.  Consequently,
the work-around that was implemented by revision 1.94 is no longer needed.
Moreover, reverting this work-around eliminates the need for
vfs_busy_pages() to acquire the page queues lock when preparing a buffer
for read.

Reviewed by:	tegge
2009-05-11 05:16:57 +00:00
imp
9a12159016 Spell NULL properly, use (void) rather than () for functions with no
parameters.  Mark two items as static that aren't used elsewhere...
2009-05-09 19:08:22 +00:00
imp
66ca9cb573 Retire kern.vm.kmem.size. It was marked as obsolete prior to 5.2, so
it can go.
2009-05-09 19:00:47 +00:00
kan
7b57a857b7 Do not embed struct ucred into larger netcred parent structures.
Credential might need to hang around longer than its parent and be used
outside of mnt_explock scope controlling netcred lifetime. Use separate
reference-counted ucred allocated separately instead.

While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up
some unused declarations in new NFS code.

Reported by:	John Hickey
PR:		kern/133439
Reviewed by:	dfr, kib
2009-05-09 18:09:17 +00:00
zec
b31e199a10 A NOP change: style / whitespace cleanup of the noise that slipped
into r191816.

Spotted by:	bz
Approved by:	julian (mentor) (an earlier version of the diff)
2009-05-08 14:34:25 +00:00
zec
639797b2e6 Introduce a new virtualization container, provisionally named vprocg, to hold
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg.  Struct vprocg is likely to become replaced in the near future with
a new jail management API import.

As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.

Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.

Permit kldload / kldunload operations to be executed only from the default
vimage context.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by:	bz
Approved by:	julian (mentor)
2009-05-08 14:11:06 +00:00
jamie
267ea54b44 Move the per-prison Linux MIB from a private one-off pointer to the new
OSD-based jail extensions.  This allows the Linux MIB to accessed via
jail_set and jail_get, and serves as a demonstration of adding jail support
to a module.

Reviewed by:	dchagin, kib
Approved by:	bz (mentor)
2009-05-07 18:36:47 +00:00