Commit Graph

10016 Commits

Author SHA1 Message Date
Jeff Roberson
40acdeabab Commit 5/14 of sched_lock decomposition.
- Protect the cp_time tick counts with atomics instead of a global lock.
   There will only be one atomic per tick and this allows all processors
   to execute softclock concurrently.
 - In softclock, protect access to rusage and td_*tick data with the
   thread_lock(), expanding the scope of the thread lock over the whole
   function.
 - Do some creative re-arranging in hardclock() to avoid excess locking.
 - Protect the p_timer fields with the per-process spinlock.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:53:06 +00:00
Jeff Roberson
a54e85fdbf Commit 4/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
   sychronization.
 - Use the per-process spinlock rather than the sched_lock for per-process
   scheduling synchronization.
 - Move some common code into thread_suspend_switch() to handle the
   mechanics of suspending a thread.  The locking here is incredibly
   convoluted and should be simplified.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:52:24 +00:00
Jeff Roberson
2502c107ba Commit 3/14 of sched_lock decomposition.
- Add a per-turnstile spinlock to solve potential priority propagation
   deadlocks that are possible with thread_lock().
 - The turnstile lock order is defined as the exact opposite of the
   lock order used with the sleep locks they represent.  This allows us
   to walk in reverse order in priority_propagate and this is the only
   place we wish to multiply acquire turnstile locks.
 - Use the turnstile_chain lock to protect assigning mutexes to turnstiles.
 - Change the turnstile interface to pass back turnstile pointers to the
   consumers.  This allows us to reduce some locking and makes it easier
   to cancel turnstile assignment while the turnstile chain lock is held.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:51:44 +00:00
Jeff Roberson
d72e80f09a Commit 2/14 of sched_lock decomposition.
- Adapt sleepqueues to the new thread_lock() mechanism.
 - Delay assigning the sleep queue spinlock as the thread lock until after
   we've checked for signals.  It is illegal for a thread to return in
   mi_switch() with any lock assigned to td_lock other than the scheduler
   locks.
 - Change sleepq_catch_signals() to do the switch if necessary to simplify
   the callers.
 - Simplify timeout handling now that locking a sleeping thread has the
   side-effect of locking the sleepqueue.  Some previous races are no
   longer possible.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:56 +00:00
Jeff Roberson
7b20fb19fb Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
   similar to solaris's container locking.
 - A per-process spinlock is now used to protect the queue of threads,
   thread count, suspension count, p_sflags, and other process
   related scheduling fields.
 - The new thread lock is actually a pointer to a spinlock for the
   container that the thread is currently owned by.  The container may
   be a turnstile, sleepqueue, or run queue.
 - thread_lock() is now used to protect access to thread related scheduling
   fields.  thread_unlock() unlocks the lock and thread_set_lock()
   implements the transition from one lock to another.
 - A new "blocked_lock" is used in cases where it is not safe to hold the
   actual thread's lock yet we must prevent access to the thread.
 - sched_throw() and sched_fork_exit() are introduced to allow the
   schedulers to fix-up locking at these points.
 - Add some minor infrastructure for optionally exporting scheduler
   statistics that were invaluable in solving performance problems with
   this patch.  Generally these statistics allow you to differentiate
   between different causes of context switches.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-04 23:50:30 +00:00
Attilio Rao
b4b7081961 Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)
2007-06-04 21:45:18 +00:00
Attilio Rao
6759608248 Rework the PCPU_* (MD) interface:
- Rename PCPU_LAZY_INC into PCPU_INC
- Add the PCPU_ADD interface which just does an add on the pcpu member
  given a specific value.

Note that for most architectures PCPU_INC and PCPU_ADD are not safe.
This is a point that needs some discussions/work in the next days.

Reviewed by: alc, bde
Approved by: jeff (mentor)
2007-06-04 21:38:48 +00:00
David Malone
041b706b2f Despite several examples in the kernel, the third argument of
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.

Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.

In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported.  In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.
2007-06-04 18:25:08 +00:00
David Malone
df82ff50ed Add a function for exporting 64 bit types. 2007-06-04 18:14:28 +00:00
Kris Kennaway
cdcc788a7e Revert some debugging KTRs that were added during development. 2007-06-03 18:24:31 +00:00
Konstantin Belousov
7a31868ed0 Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file:
part 2. Convert calls missed in the first big commit.

Noted by:	rwatson
Pointy hat to:	kib
2007-06-01 14:33:11 +00:00
Jeff Roberson
1c4bcd050a - Move rusage from being per-process in struct pstats to per-thread in
td_ru.  This removes the requirement for per-process synchronization in
   statclock() and mi_switch().  This was previously supported by
   sched_lock which is going away.  All modifications to rusage are now
   done in the context of the owning thread.  reads proceed without locks.
 - Aggregate exiting threads rusage in thread_exit() such that the exiting
   thread's rusage is not lost.
 - Provide a new routine, rufetch() to fetch an aggregate of all rusage
   structures from all threads in a process.  This routine must be used
   in any place requiring a rusage from a process prior to it's exit.  The
   exited process's rusage is still available via p_ru.
 - Aggregate tick statistics only on demand via rufetch() or when a thread
   exits.  Tick statistics are kept in the thread and protected by sched_lock
   until it exits.

Initial patch by:	attilio
Reviewed by:		attilio, bde (some objections), arch (mostly silent)
2007-06-01 01:12:45 +00:00
Attilio Rao
2feb50bf7d Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
Paolo Pisati
3401f2c1df In some particular cases (like in pccard and pccbb), the real device
handler is wrapped in a couple of functions - a filter wrapper and an
ithread wrapper. In this case (and just in this case), the filter
wrapper could ask the system to schedule the ithread and mask the
interrupt source if the wrapped handler is composed of just an ithread
handler: modify the "old" interrupt code to make it support
this situation, while the "new" interrupt code is already ok.

Discussed with: jhb
2007-05-31 19:25:35 +00:00
Konstantin Belousov
9e223287c0 Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by:	jhb
Reviewed by:	daichi (unionfs)
Approved by:	re (kensmith)
2007-05-31 11:51:53 +00:00
Robert Watson
049c3b6cdf Now that sx(9) locks support an interruptible lock acquire primitive,
properly observe the SB_NOINTR flag in sblock.  This restores the
required behavior that lock acquisition be interruptible on the socket
buffer I/O serialization lock to allow threads waiting for I/O to be
signaled even if they aren't the thread currently holding the I/O lock.
With this change, the sblock regression test is again passed.

Reported by:		alfred
sx(9) handiwork:	attilio
2007-05-31 11:51:22 +00:00
Attilio Rao
f9819486e5 Add functions sx_xlock_sig() and sx_slock_sig().
These functions are intended to do the same actions of sx_xlock() and
sx_slock() but with the difference to perform an interruptible sleep, so
that sleep can be interrupted by external events.
In order to support these new featueres, some code renstruction is needed,
but external API won't be affected at all.

Note: use "void" cast for "int" returning functions in order to avoid tools
like Coverity prevents to whine.

Requested by: rwatson
Tested by: rwatson
Reviewed by: jhb
Approved by: jeff (mentor)
2007-05-31 09:14:48 +00:00
Attilio Rao
2c7289cbfa style(9) fixes for sx locks.
Approved by: jeff (mentor)
2007-05-29 19:46:37 +00:00
Attilio Rao
acf840c4bd Add a small fix for lock profiling in sx locks.
"0" cannot be a correct value since when the function is entered at least
one shared holder must be present and since we want the last one "1" is
the correct value.
Note that lock_profiling for sx locks is far from being perfect.
Expect further fixes for that.

Approved by: jeff (mentor)
2007-05-29 19:34:32 +00:00
Attilio Rao
02b0a160dc Fix some problems introduced with the last descriptors tables locking
patch:
- Do the correct test for ldt allocation
- Drop dt_lock just before to call kmem_free (since it acquires blocking
  locks inside)
- Solve a deadlock with smp_rendezvous() where other CPU will wait
  undefinitively for dt_lock acquisition.
- Add dt_lock in the WITNESS list of spinlocks

While applying these modifies, change the requirement for user_ldt_free()
making that returning without dt_lock held.

Tested by: marcus, tegge
Reviewed by: tegge
Approved by: jeff (mentor)
2007-05-29 18:55:41 +00:00
Robert Watson
03c96c3176 Add DDB "show unpcb" command, allowing DDB to print out many pertinent
details from UNIX domain socket protocol layer state.
2007-05-29 12:36:00 +00:00
Ed Maste
911d16b8cd Revert 1.197 and instead avoid calling kdb_enter() if the KDB_UNATTENDED
option is in use.
2007-05-28 21:50:54 +00:00
Warner Losh
cfa7a8beea Simplify the kernel configuration file return code.
Reviewed by: wkoszek
2007-05-28 20:41:10 +00:00
Ed Maste
1e62d77c09 Eliminate explicit kdb_enter in the software watchdog handler (which
produced incorrect behaviour with the KDB_UNATTENDED option) and call
panic in both the KDB and non-KDB cases.  This change is consistent
with rwatson's current kdb/ddb work.
2007-05-28 19:51:12 +00:00
Robert Watson
dede2ab3b2 In kern_kevent(), unconditionally fdrop() fp once fget() has succeeded,
as we never have an opportunity to set it to NULL.

Found with:	Coverity Prevent(tm)
CID:		2161
2007-05-28 17:15:05 +00:00
Robert Watson
e1e8f51b85 Universally adopt most conventional spelling of acquire. 2007-05-27 20:50:23 +00:00
Robert Watson
87066f04c6 Select a more appealing spelling for the word acquire. 2007-05-27 19:24:00 +00:00
Robert Watson
1c293049d9 Add parens around *free in *free++ in mbp_count() so that mbp_count()
actually works.  mbp_count() turns out only to be used in debugging code
in if_patm_intr.c, so this bug did not affect much in practice.

Found with:	Coverity Prevent(tm)
CID:		1943
2007-05-27 17:38:36 +00:00
Robert Watson
097e1ea87f Remove amountpipes counter for pipes -- this replicates the function of
existing UMA statistics for pipes, and allows us to get rid of both the
per-pipe dtor and two atomic operations per pipe required to maintain
the counter.
2007-05-27 17:33:10 +00:00
Robert Watson
e4e80aa713 Remove #if 0'd check for 0-size allocations, which if enabled, called
kdb_enter().
2007-05-27 13:13:46 +00:00
Pawel Jakub Dawidek
6e042171bd To avoid a deadlock when handling .. directory during a lookup, we unlock
parent vnode and relock it after locking child vnode. The problem was that
we always relock it exclusively, even when it was share-locked.

Discussed with:	jeff
2007-05-25 22:23:38 +00:00
Pawel Jakub Dawidek
b4c85af977 We no longer need to put namecache entries onto temporary mplist.
It was useful in revision 1.86, but should have been removed in 1.89.
2007-05-25 22:19:49 +00:00
Pawel Jakub Dawidek
950afe9972 The cache_leaf_test() function seems to be unused, so remove it. 2007-05-25 22:16:17 +00:00
Sam Leffler
3c86b7cdad fix comment typo 2007-05-23 17:28:21 +00:00
Robert Watson
4dec0e67ea Comment that tdsignal() may be entered from the debugger. 2007-05-23 17:27:42 +00:00
Robert Watson
63d69d2592 Initialize time_lock before calling cpu_initclocks(). This corrects a
race condition in which hardclock fires before the mutex is initialized
leading to a "corrupt spinlock" panic.

Submitted by:	attilio
2007-05-23 17:27:01 +00:00
Olivier Houchard
302e130edc Remove duplicate includes.
Submitted by:   Cyril Nguyen Huu <cyril ci0 org>
2007-05-23 13:36:02 +00:00
Pawel Jakub Dawidek
f013ccb768 - Remove redundant initialization.
- Compare pointer with NULL.
2007-05-22 23:05:48 +00:00
Diomidis Spinellis
fdbe5babe4 Increase precision of time values in the process accounting
structure, while maintaining backward compatibility with legacy
file and record formats.
2007-05-22 06:51:38 +00:00
Jeff Roberson
8b98fec903 - Move clock synchronization into a seperate clock lock so the global
scheduler lock is not involved.  sched_lock still protects the sched_clock
   call.  Another patch will remedy this.

Contributed by:	Attilio Rao <attilio@FreeBSD.org>
Tested by:	kris, jeff
2007-05-20 22:11:50 +00:00
John Baldwin
7ec137e5b0 Rename the macros for assertion flags passed to sx_assert() from SX_* to
SA_* to match mutexes and rwlocks.  The old flags still exist for
backwards compatiblity.

Requested by:	attilio
2007-05-19 21:26:05 +00:00
Andre Oppermann
1c9dbd1567 In kern_sendfile() adjust byte accounting of the file sending loop to
ignore the size of any headers that were passed with the sendfile(2)
system call.  Otherwise the file sent will be truncated by the header
size if the nbytes parameter was provided.  The bug doesn't show up
when either nbytes is zero, meaning send the whole file, or no header
iovec is provided.

Resolve a potential error aliasing of errors from the VM and sf_buf
parts and the protocol send parts where an error of the latter over-
writes one of the former.

Update comments.

The byte accounting bug wasn't seen in earlier because none of the popular
sendfile(2) consumers, Apache, lighttpd and our ftpd(8) use it in modes
that trigger it.  The varnish HTTP proxy makes full use of it and exposed
the problem.

Bug found by:	phk
Tested by:	phk
2007-05-19 20:50:59 +00:00
John Baldwin
71a95881d4 Expose sx_xholder() as a public macro. It returns a pointer to the thread
that holds the current exclusive lock, or NULL if no thread holds an
exclusive lock.

Requested by:	pjd
2007-05-19 20:18:12 +00:00
John Baldwin
46a8b9cbc9 Oops, didn't include SX_ADAPTIVESPIN in the list of valid flags for the
assert in sx_init_flags().

Submitted by:	attilio
2007-05-19 18:34:24 +00:00
John Baldwin
b0d673254d Add a new SX_RECURSE flag to make support for recursive exclusive locks
conditional.  By default, sx(9) locks are back to not supporting recursive
exclusive locks.

Submitted by:	attilio
2007-05-19 16:35:27 +00:00
Alexander Kabaev
ee9f46615e Add kern.arnd sysctl. SSP code uses it to initialize the stack guard
magic value.

Submitted by: Jeremie Le Hen <jeremie@le-hen.org>
2007-05-19 04:53:14 +00:00
Robert Watson
eefc497257 Remove unnecessary assignment.
CID:		2227
Found with:	Coverity Prevent(tm)
2007-05-18 21:10:08 +00:00
John Baldwin
da7d0d1e24 Fix a comment. 2007-05-18 15:05:41 +00:00
John Baldwin
c91fcee75d Move lock_profile_object_{init,destroy}() into lock_{init,destroy}(). 2007-05-18 15:04:59 +00:00
Konstantin Belousov
d413d21071 Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.
2007-05-18 13:02:13 +00:00
Jeff Roberson
222d01951f - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
Jeff Roberson
2b7e2ee7a5 - Convert turnstiles and sleepqueus to use UMA. This provides a modest
speedup and will be more useful after each gains a spinlock in the
   impending thread_lock() commit.
 - Move initialization and asserts into init/fini routines.  fini routines
   are only needed in the INVARIANTS case for now.

Submitted by:	Attilio Rao <attilio@FreeBSD.org>
Tested by:	kris, jeff
2007-05-18 06:32:24 +00:00
Peter Wemm
c6b342f820 Eliminate a micro-optimization that hasn't had any effect for 15+ years. 2007-05-17 15:31:14 +00:00
Warner Losh
3627f73782 Don't export a kern.conftxt sysctl, except when INCLUDE_CONF_FILE is
defined.  This restores the old behavior, and eliminates the
dependency on the kernconf.tmpl when INCLUDE_CONFIG_FILE isn't
included in the kernel config.  There were many people in the terminal
room that had almost, but not quite, up-to-date config files that this
helps.  I don't know if this is the result of skew among the cvsup
servers, or some other more subtle problem.  However, this fix should
work for any config of recent vintage (I tested with the latest, and
one before the recent changes, and eye-balled the intermediate
versions).

Reviewed by: the terminal room crew
2007-05-17 05:05:12 +00:00
Robert Watson
d19e16a72c Generally migrate to ANSI function headers, and remove 'register' use. 2007-05-16 20:41:08 +00:00
Wojciech A. Koszek
5f9974ae57 Handle !INCLUDE_CONFIG_FILE entirely in the kernel. This should make some
developers happy, since it will let them to use old config(8) with newer
kernels.

Reviewed by:	imp
Approved by:	imp
2007-05-16 16:08:04 +00:00
John Baldwin
19059a13ed Rework the support for ABIs to override resource limits (used by 32-bit
processes under 64-bit kernels).  Previously, each 32-bit process overwrote
its resource limits at exec() time.  The problem with this approach is that
the new limits affect all child processes of the 32-bit process, including
if the child process forks and execs a 64-bit process.  To fix this, don't
ovewrite the resource limits during exec().  Instead, sv_fixlimits() is
now replaced with a different function sv_fixlimit() which asks the ABI to
sanitize a single resource limit.  We then use this when querying and
setting resource limits.  Thus, if a 32-bit process sets a limit, then
that new limit will be inherited by future children.  However, if the
32-bit process doesn't change a limit, then a future 64-bit child will
see the "full" 64-bit limit rather than the 32-bit limit.

MFC is tentative since it will break the ABI of old linux.ko modules (no
other modules are affected).

MFC after:	1 week
2007-05-14 22:40:04 +00:00
John Baldwin
1bba2a940b Move cpu_exit() earlier in exit1() to close a race between
SIGCHLD/kevent(2) notification of process termination and wait().  Now
we no longer drop locks between sending the notification and marking
the process as a zombie.  Previously, if another process attempted to do
a wait() with W_NOHANG after receiving a SIGCHLD or kevent and locked
the process while the exiting thread was in cpu_exit(), then wait() would
fail to find the process, which is quite astonishing to the process
calling wait().

MFC after:	3 days
2007-05-14 22:21:58 +00:00
Kirk McKusick
2f79c90982 Update entries for building tags. 2007-05-13 18:21:54 +00:00
Wojciech A. Koszek
744b947ef8 Improve INCLUDE_CONFIG_FILE support.
This change will let us to have full configuration of a running kernel
available in sysctl:

	sysctl -b kern.conftxt

The same configuration is also contained within the kernel image. It can be
obtained with:

	config -x <kernelfile>

Current functionality lets you to quickly recover kernel configuration, by
simply redirecting output from commands presented above and starting kernel
build procedure. "include" statements are also honored, which means options
and devices from included files are also included.

Please note that comments from configuration files are not preserved by
default. In order to preserve them, you can use -C flag for config(8). This
will bring configuration file and included files literally; however,
redirection to a file no longer works directly.

This commit was followed by discussion, that took place on freebsd-current@.
For more details, look here:

	http://lists.freebsd.org/pipermail/freebsd-current/2007-March/069994.html
	http://lists.freebsd.org/pipermail/freebsd-current/2007-May/071844.html

Development of this patch took place in Perforce, hierarchy:

	//depot/user/wkoszek/wkoszek_kconftxt/

Support from:	freebsd-current@ (links above)
Reviewed by:	imp@
Approved by:	imp@
2007-05-12 19:38:18 +00:00
Andre Oppermann
0489b64c5e Make the TCP timer callout obtain Giant if the network stack is marked
as non-mpsafe.

This change is to be removed when all protocols are mp-safe.
2007-05-11 20:52:47 +00:00
Robert Watson
08d73f1370 Remove more one more stale comment regarding unpcb type-safety. 2007-05-11 12:28:45 +00:00
Robert Watson
d7924b7086 Clarify and update quite a few comments to reflect locking optimizations,
the addition of unpcb refcounts, and bug fixes.  Some of these fixes are
appropriate for MFC.

MFC after:	3 days
2007-05-11 12:10:45 +00:00
John Baldwin
0026c92c3e Add destroyed cookie values for sx locks and rwlocks as well as extra
KASSERTs so that any lock operations on a destroyed lock will panic or
hang.
2007-05-08 21:51:37 +00:00
John Baldwin
c0bfd70306 Teach 'show lock' to properly handle a destroyed mutex. 2007-05-08 21:50:46 +00:00
John Baldwin
9fa7ce0f23 Fix a potential LOR with sx_sleep() and cv_wait() with sx locks by
1) adding the thread to the sleepq via sleepq_add() before dropping the
lock, and 2) dropping the sleepq lock around calls to lc_unlock() for
sleepable locks (i.e. locks that use sleepq's in their implementation).
2007-05-08 21:49:59 +00:00
Pyun YongHyeon
ccd8d954f3 Add missing socket buffer unlock before returning to userland.
Reviewed by:    rwatson
2007-05-08 12:34:14 +00:00
Paolo Pisati
bafe5a3118 Bring in the reminaing bits to make interrupt filtering work:
o push much of the i386 and amd64 MD interrupt handling code
  (intr_machdep.c::intr_execute_handlers()) into MI code
  (kern_intr.c::ithread_loop())
o move filter handling to kern_intr.c::intr_filter_loop()
o factor out the code necessary to mask and ack an interrupt event
  (intr_machdep.c::intr_eoi_src() and intr_machdep.c::intr_disab_eoi_src()),
  and make them part of 'struct intr_event', passing them as arguments to
  kern_intr.c::intr_event_create().
o spawn a private ithread per handler (struct intr_handler::ih_thread)
  with filter and ithread functions.

Approved by: re (implicit?)
2007-05-06 17:02:50 +00:00
Wojciech A. Koszek
9e2894466a Don't acquire Giant unconditionally.
Reviewed by:	rwatson
2007-05-06 12:00:38 +00:00
Konstantin Belousov
5c76452f8f Mark the filedescriptor table entries with VOP_OPEN being performed for them
as UF_OPENING. Disable closing of that entries. This should fix the crashes
caused by devfs_open() (and fifo_open()) dereferencing struct file * by
index, while the filedescriptor is closed by parallel thread.

Idea by:	tegge
Reviewed by:	tegge (previous version of patch)
Tested by:	Peter Holm
Approved by:	re (kensmith)
MFC after:	3 weeks
2007-05-04 14:23:29 +00:00
Robert Watson
7abab91135 sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags
on each socket buffer with the socket buffer's mutex.  This sleep lock is
used to serialize I/O on sockets in order to prevent I/O interlacing.

This change replaces the custom sleep lock with an sx(9) lock, which
results in marginally better performance, better handling of contention
during simultaneous socket I/O across multiple threads, and a cleaner
separation between the different layers of locking in socket buffers.
Specifically, the socket buffer mutex is now solely responsible for
serializing simultaneous operation on the socket buffer data structure,
and not for I/O serialization.

While here, fix two historic bugs:

(1) a bug allowing I/O to be occasionally interlaced during long I/O
    operations (discovere by Isilon).

(2) a bug in which failed non-blocking acquisition of the socket buffer
    I/O serialization lock might be ignored (discovered by sam).

SCTP portion of this patch submitted by rrs.
2007-05-03 14:42:42 +00:00
Alan Cox
fa75abb0d2 Remove unneeded include files. 2007-05-01 06:35:54 +00:00
John-Mark Gurney
ebf750a9fd Complete removal of restriction about overlaps to rman_manage_region:
remove comment and man page verbage...

Document return values for rman_init and rman_manage_region..

MFC after:	1 week
2007-04-28 07:37:49 +00:00
John Baldwin
06e043fb20 Avoid a lot of code duplication by using kern_open() to open /dev/null
in fdcheckstd() instead of a stripped down version of kern_open()'s code.

MFC after:	1 week
Reviewed by:	cperciva
2007-04-26 18:01:19 +00:00
Konstantin Belousov
e5ea32c290 Allow the dounmount() to proceed even for doomed coveredvp.
In dounmount(), before or while vn_lock(coveredvp) is called, coveredvp
vnode may be VI_DOOMED due to one of the following:
- other thread finished unmount and vput()ed it, and vnode was chosen
  for recycling, while vn_lock() slept;
- forced unmount of the coveredvp->v_mount fs.
In the first case, next check for changed v_mountedhere or mnt_gen counter
would be successfull. In the second case, the unmount shall be allowed.

Submitted by:	sobomax
MFC after:	2 weeks
2007-04-26 08:56:56 +00:00
Konstantin Belousov
8e68f804a7 Disable nesting of BOP_BDFLUSH(). VOP_FSYNC() call in bdwrite() could
result in bdwrite() being reentered, thus causing infinite recursion.

Reported and tested by:	Peter Holm
Reviewed by:	tegge
MFC after:	2 weeks
2007-04-24 10:59:21 +00:00
Pawel Jakub Dawidek
8c804c7c98 Correct typo. 2007-04-23 12:53:00 +00:00
Robert Watson
c14d15ae3e Remove MAC Framework access control check entry points made redundant with
the introduction of priv(9) and MAC Framework entry points for privilege
checking/granting.  These entry points exactly aligned with privileges and
provided no additional security context:

- mac_check_sysarch_ioperm()
- mac_check_kld_unload()
- mac_check_settime()
- mac_check_system_nfsd()

Add mpo_priv_check() implementations to Biba and LOMAC policies, which,
for each privilege, determine if they can be granted to processes
considered unprivileged by those two policies.  These mostly, but not
entirely, align with the set of privileges granted in jails.

Obtained from:	TrustedBSD Project
2007-04-22 15:31:22 +00:00
Stephane E. Potvin
0e5179e441 Add support for specifying a minimal size for vm.kmem_size in the loader via
vm.kmem_size_min. Useful when using ZFS to make sure that vm.kmem size will
be at least 256mb (for example) without forcing a particular value via vm.kmem_size.

Approved by: njl (mentor)
Reviewed by: alc
2007-04-21 01:14:48 +00:00
Pawel Jakub Dawidek
eed20b37f5 Don't reinvent vm_page_grab().
Reviewed by:	ups
2007-04-20 19:49:20 +00:00
Kip Macy
fb1e3ccd7e Schedule the ithread on the same cpu as the interrupt
Tested by: kmacy
Submitted by: jeffr
2007-04-20 05:45:46 +00:00
Joseph Koshy
382d30cdd8 Fix witness(4) warnings about mutex use.
Group mutexes used in hwpmc(4) into 3 "types" in the sense of
witness(4):

 - leaf spin mutexes---only one of these should be held at a time,
   so these mutexes are specified as belonging to a single witness
   type "pmc-leaf".

 - `struct pmc_owner' descriptors are protected by a spin mutex of
   witness type "pmc-owner-proc".  Since we call wakeup_one() while
   holding these mutexes, the witness type of these mutexes needs
   to dominate that of "sleepq chain" mutexes.

 - logger threads use a sleep mutex, of type "pmc-sleep".

Submitted by:	wkoszek (earlier patch)
2007-04-19 08:02:51 +00:00
Pawel Jakub Dawidek
fb1daf8164 Fix a bug in sendfile(2) when files larger than page size and nbytes=0.
When nbytes=0, sendfile(2) should use file size. Because of the bug, it
was sending half of a file. The bug is that 'off' variable can't be used
for size calculation, because it changes inside the loop, so we should
use uap->offset instead.
2007-04-19 05:54:45 +00:00
Nate Lawson
0ae62c18a0 Bump the interrupt storm detection counter to 1000. My slow fileserver
gets a bogus irq storm detected when periodic daily kicks off at 3 am
and disconnects the disk.  Change the print logic to print once per second
when the storm is occurring instead of only once.  Otherwise, it appeared
that something else was causing the errors each night at 3 am since the
print only occurred the first time.

Reviewed by:	jhb
MFC after:	1 week
2007-04-19 01:24:32 +00:00
Pawel Jakub Dawidek
7760d8409f Export vfs_mount_alloc() as it is used in ZFS. 2007-04-17 21:14:06 +00:00
John Baldwin
2248f68064 - Add a 'show rman <rm>' DDB command to dump the resources in a resource
manager similar to 'devinfo -u'.
- Add a 'show allrman' DDB command that effectively does 'show rman' on all
  resource managers in the system.
2007-04-16 21:09:03 +00:00
Kip Macy
21ee3e7aff remove now invalid check from m_sanity
panic on m_sanity check failure with INVARIANTS
2007-04-14 20:19:16 +00:00
Pawel Jakub Dawidek
24b0502ee0 Fix jails and jail-friendly file systems handling:
- We need to allow for PRIV_VFS_MOUNT_OWNER inside a jail.
- Move security checks to vfs_suser() and deny unmounting and updating
  for jailed root from different jails, etc.

OK'ed by:	rwatson
2007-04-13 23:54:22 +00:00
Pawel Jakub Dawidek
6bc3ab2574 When we are running low on vnodes, there is currently no way to ask other
subsystems to release some vnodes. Implement backpressure based on
vfs_lowvnodes event (similar to vm_lowmem for memory).
2007-04-13 08:38:48 +00:00
Robert Watson
94b94b2b49 Remove now-obsolete comment regarding mqueue privileges in jail. 2007-04-11 16:22:59 +00:00
Robert Watson
4b08405682 Allow PRIV_NETINET_REUSEPORT in jail. 2007-04-10 15:59:49 +00:00
Robert Watson
9956b3f5e4 Do allow POSIX mqueue unlink privilege inside a jail, as we all all
other POSIX mqueue privileges inside a jail.
2007-04-10 15:40:27 +00:00
Pawel Jakub Dawidek
08be819487 Minor style cleanups (mostly removal of trailing whitespaces). 2007-04-10 15:29:37 +00:00
Pawel Jakub Dawidek
21ff8c6715 Correct typos. 2007-04-10 15:22:40 +00:00
Nate Lawson
a363f67a81 Restore the locking for the sleep/wakeup to avoid waiting an extra 1 sec
if a race was lost.  We're still single-threaded at this point, but just
be safe for the future.
2007-04-09 21:10:04 +00:00
Nate Lawson
6b1e469ea5 Clean up the root mount and mount wait code. No mutexes are needed here
since a spurious wakeup() is the only possible outcome and this is fine in
the BSD programming model.
2007-04-09 19:23:52 +00:00
Pawel Jakub Dawidek
82068fe7a9 Add kern.hostuuid sysctl, which will be used to keep host's UUID.
Reviewed by:	mlaier, rink, brooks, rwatson
2007-04-09 19:18:09 +00:00
Pawel Jakub Dawidek
2eb68d493f Add root_mounted() function that returns true if the root file system is
already mounted.
2007-04-08 23:54:01 +00:00
Pawel Jakub Dawidek
c2cda60911 prison_free() can be called with a mutex held. This wasn't a problem until
I converted allprison_mtx mutex to allprison_lock sx lock. To fix this LOR,
move prison removal to prison_complete() entirely. To ensure that noone
will reference this prison before it's beeing removed from the list skip
prisons with 'pr_ref == 0' in prison_find() and assert that pr_ref has to
greater than 0 in prison_hold().

Reported by:	kris
OK'ed by:	rwatson
2007-04-08 10:46:23 +00:00
Pawel Jakub Dawidek
b63b0c6529 Only use prison mutex to protect the fields that need to be protected by it. 2007-04-08 10:21:38 +00:00