Commit Graph

9944 Commits

Author SHA1 Message Date
Pawel Jakub Dawidek
695919ad9a Make vfs_mount_destroy() and vfs_freeopts() non-static, I'd like to use them. 2007-03-31 22:44:45 +00:00
Robert Watson
e92d773fbc Rather than ignoring any error return from getnewvnode() in nameiinit(),
explicitly test and panic.  This should not ever happen, but if it does,
this is a preferred failure mode to a NULL pointer dereference in kernel.

Coverity CID:	1716
Found with:	Coverity Prevent(tm)
2007-03-31 16:08:50 +00:00
John Baldwin
b80ad3eea1 - Drop memory barriers in rw_try_upgrade(). We don't need an 'acq' memory
barrier here as the earlier rw_rlock() already contained one.
- Comment fix.
2007-03-30 18:08:55 +00:00
John Baldwin
ab2dab1680 - Use lock_init/lock_destroy() to setup the lock_object inside of lockmgr.
We can now use LOCK_CLASS() as a stronger check in lockmgr_chain() as a
  result.  This required putting back lk_flags as lockmgr's use of flags
  conflicted with other flags in lo_flags otherwise.
- Tweak 'show lock' output for lockmgr to match sx, rw, and mtx.
2007-03-30 18:07:24 +00:00
Wojciech A. Koszek
2404c938e6 vm_map_delete should be used only internally, by the VM subsystem. Replace
it with vm_map_remove, which not only embeds additional check, but also
takes care of locking.

Reviewed by:	alc
Approved by:	alc, cognet (mentor)
2007-03-29 13:26:13 +00:00
John Baldwin
4649e92b4e Align 'struct thread' on 16 byte boundaries so that the lower 4 bits are
always 0.  Previously we aligned threads on a minimum of 8-byte boundaries.

Note: This changes the uma zone to no longer cache align threads.  We
really want the uma zone to do align threads to MAX(16, cache line size)
but there currently isn't a good way to express that to uma.

Submitted by:	attilio
2007-03-27 16:51:34 +00:00
Marcel Moolenaar
f3ea971bf0 PowerPC is the only architecture with mpsafe_vfs=0. This is now
broken. Rudimentary tests show that PowerPC can run with
mpsafe_vfs=1. Make it so...
2007-03-27 05:29:41 +00:00
Nate Lawson
0d4ac62a35 Add an interface for drivers to be notified of changes to CPU frequency.
cpufreq_pre_change is called before the change, giving each driver a chance
to revoke the change.  cpufreq_post_change provides the results of the
change (success or failure).  cpufreq_levels_changed gives the unit number
of the cpufreq device whose number of available levels has changed.  Hook
in all the drivers I could find that needed it.

* TSC: update TSC frequency value.  When the available levels change, take the
highest possible level and notify the timecounter set_cputicker() of that
freq.  This gets rid of the "calcru: runtime went backwards" messages.
* identcpu: updates the sysctl hw.clockrate value
* Profiling: if profiling is active when the clock changes, let the user
know the results may be inaccurate.

Reviewed by:	bde, phk
MFC after:	1 month
2007-03-26 18:03:29 +00:00
Ed Maste
caa8943810 Avoid manipulating semu_list outside of the scope of SEMUNDO_LOCK(). This
would lead to an occasional hang with a cycle in semu_list.

X-Discussed-On: hackers@
2007-03-26 17:41:14 +00:00
Robert Watson
8c799760e1 Following movement of functions from uipc_socket2.c to uipc_socket.c and
uipc_sockbuf.c, clean up and update comments.
2007-03-26 17:05:09 +00:00
Robert Watson
20d9e5e87c Complete removal of uipc_socket2.c by moving the last few functions to
other C files:

- Move sbcreatecontrol() and sbtoxsockbuf() to uipc_sockbuf.c.  While
  sbcreatecontrol() is really an mbuf allocation routine, it does its work
  with awareness of the layout of socket buffer memory.

- Move pru_*() protocol switch stubs to uipc_socket.c where the non-stub
  versions of several of these functions live.  Likewise, move socket state
  transition calls (soisconnecting(), etc) to uipc_socket.c.  Moveo
  sodupsockaddr() and sotoxsocket().
2007-03-26 08:59:03 +00:00
Kris Kennaway
0c9c08dd9c Correct a comment typo 2007-03-25 10:07:23 +00:00
Kris Kennaway
bd37fd7220 Update a comment: we usually call exec_vmspace_new with Giant not held,
but sometimes it is.
2007-03-25 10:05:44 +00:00
Ed Maste
13b762a304 Stop setting ki_ocomm (thread name) to the proc name by default, as nothing
in the base system relies on this any longer.
2007-03-23 04:01:08 +00:00
John Baldwin
cd6e6e4e11 - Simplify the #ifdef's for adaptive mutexes and rwlocks by conditionally
defining a macro earlier in the file.
- Add NO_ADAPTIVE_RWLOCKS option to disable adaptive spinning for rwlocks.
2007-03-22 16:09:23 +00:00
Gleb Smirnoff
cd68a3f706 Move the dom_dispose and pru_detach calls in sofree() earlier. Only after
calling pru_detach we can be absolutely sure, that we don't have any
references to the socket in the stack.

This closes race between lockless sbdestroy() and data arriving on socket.

Reviewed by:	rwatson
2007-03-22 13:21:24 +00:00
John Baldwin
8f27b08e87 Rename the cv_*wait*() functions to _cv_*wait*() and change their second
argument from a mutex to a lock_object.  Add cv_*wait*() wrapper macros
that accept either a mutex, rwlock, or sx lock as the second argument and
convert it to a lock_object and then call _cv_*wait*().  Basically, the
visible difference is that you can now use rwlocks and sx locks with
condition variables using the same API as with mutexes.
2007-03-21 22:22:13 +00:00
John Baldwin
aa89d8cd52 Rename the 'mtx_object', 'rw_object', and 'sx_object' members of mutexes,
rwlocks, and sx locks to 'lock_object'.
2007-03-21 21:20:51 +00:00
John Baldwin
503916a7c1 Don't use cv_wait_unlock() to implement cv_wait(). Instead, implement
cv_wait() fully and add missing KTRACE context switch traces.
2007-03-21 20:46:26 +00:00
John Baldwin
ecd8246189 If vn_open() fails during kern_open(), don't fdrop() the new file object
until after the call to fdclose().  This closes an obscure race that
could result in the later call to fdclose() actually closing a different
file descriptor if another thread close()'s the file descriptor being
opened before fdrop() is called, so the fdrop() in kern_open() frees the
file object, then the second thread (or a third) creates a new file
descriptor which reuses both the same index and the same file pointer
thus tricking fdclose() in the first thread into thinking that the
original file was still open.

MFC after:	1 week
2007-03-21 19:32:08 +00:00
John Baldwin
6d257b6e70 Handle the case when a thread is blocked on a lockmgr lock with LK_DRAIN
in DDB's 'show sleepchain'.

MFC after:	3 days
2007-03-21 19:28:20 +00:00
Andre Oppermann
4e02375908 Maintain a pointer and offset pair into the socket buffer mbuf chain to
avoid traversal of the entire socket buffer for larger offsets on stream
sockets.

Adjust tcp_output() make use of it.

Tested by:	gallatin
2007-03-19 18:35:13 +00:00
Pawel Jakub Dawidek
9a2fd584b4 Don't deny unmounting file systems for jailed processes immediately, allow
prison_priv_check() to decide what to do.

This change is suppose not to change current (security) behaviour
in any way.

This change is simlar to the change of PRIV_VFS_MOUNT in previous revision.
2007-03-18 02:39:19 +00:00
Jeff Roberson
52bc574cc7 - Handle the case where slptime == runtime.
Submitted by:	Atoine Brodin
2007-03-17 23:32:48 +00:00
Jeff Roberson
4499aff6ec - Cast the intermediate value in priority computtion back down to
unsigned char.  Weirdly, casting the 1 constant to u_char still produces
   a signed integer result that is then used in the % computation.  This
   avoids that mess all together and causes a 0 pri to turn into 255 % 64
   as we expect.

Reported by:	kkenn (about 4 times, thanks)
2007-03-17 18:13:32 +00:00
John Baldwin
3076ca6720 Just use 'fdrop()' instead of 'FILE_LOCK(); fdrop_locked()' in
dupfdopen().  While I'm at it, move the second fdrop() out from under the
filedesc lock.
2007-03-15 21:19:21 +00:00
Pawel Jakub Dawidek
7533652025 Don't deny mounting for jailed processes immediately, allow
prison_priv_check() to decide what to do.

This change is suppose not to change current (security) behaviour
in any way.

Reviewed by:	rwatson
2007-03-14 13:09:59 +00:00
Pawel Jakub Dawidek
f7d4e990c7 White space nits. 2007-03-14 12:54:10 +00:00
Konstantin Belousov
71d49316cc Busy filesystem around call of VFS_QUOTACTL() vfs op.
Tested by:	Peter Holm
Reviewed by:	tegge
Approved by:	re (kensmith)
2007-03-14 08:45:55 +00:00
John Baldwin
c1f2a5334d Print readers count as unsigned in ddb 'show lock'.
Submitted by:	attilio
2007-03-13 16:51:27 +00:00
Tor Egge
61b9d89ff0 Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount).  On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque().  Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode.  The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup.  Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by:	re (kensmith)
Reviewed by:	kib
In collaboration with:	kib
2007-03-13 01:50:27 +00:00
John Baldwin
4b493b1a6d Fix a typo. 2007-03-12 20:10:29 +00:00
John Baldwin
7568503421 - Use m_gethdr(), m_get(), and m_clget() instead of the macros in
sosend_copyin().
- Use M_WAITOK instead of M_TRYWAIT in sosend_copyin().
- Don't check for NULL from M_WAITOK and return ENOBUFS.
  M_WAITOK/M_TRYWAIT allocations don't fail with NULL.

Reviewed by:	andre
Requested by:	andre (2)
2007-03-12 19:27:36 +00:00
Robert Watson
6e2faa2444 In uipc_close(), we no longer always free the unpcb, as the last reference
may be dropped later.  In this case, always unlock the unpcb so as not to
leak the lock.

Found by:	kris (BugMagnet)
2007-03-12 14:52:00 +00:00
John Baldwin
6caa5f40a2 Use sx_sleep() in the main loop of the accounting kthread. 2007-03-09 23:29:31 +00:00
John Baldwin
e7573e7ad7 Allow threads to atomically release rw and sx locks while waiting for an
event.  Locking primitives that support this (mtx, rw, and sx) now each
include their own foo_sleep() routine.
- Rename msleep() to _sleep() and change it's 'struct mtx' object to a
  'struct lock_object' pointer.  _sleep() uses the recently added
  lc_unlock() and lc_lock() function pointers for the lock class of the
  specified lock to release the lock while the thread is suspended.
- Add wrappers around _sleep() for mutexes (mtx_sleep()), rw locks
  (rw_sleep()), and sx locks (sx_sleep()).  msleep() still exists and
  is now identical to mtx_sleep(), but it is deprecated.
- Rename SLEEPQ_MSLEEP to SLEEPQ_SLEEP.
- Rewrite much of sleep.9 to not be msleep(9) centric.
- Flesh out the 'RETURN VALUES' section in sleep.9 and add an 'ERRORS'
  section.
- Add __nonnull(1) to _sleep() and msleep_spin() so that the compiler will
  warn if you try to pass a NULL wait channel.  The functions already have
  a KASSERT to that effect.
2007-03-09 22:41:01 +00:00
John Baldwin
6e21afd40c Add two new function pointers 'lc_lock' and 'lc_unlock' to lock classes.
These functions are intended to be used to drop a lock and then reacquire
it when doing an sleep such as msleep(9).  Both functions accept a
'struct lock_object *' as their first parameter.  The 'lc_unlock' function
returns an integer that is then passed as the second paramter to the
subsequent 'lc_lock' function.  This can be used to communicate state.
For example, sx locks and rwlocks use this to indicate if the lock was
share/read locked vs exclusive/write locked.

Currently, spin mutexes and lockmgr locks do not provide working lc_lock
and lc_unlock functions.
2007-03-09 16:27:11 +00:00
John Baldwin
3ff6d22988 Use C99-style struct member initialization for lock classes. 2007-03-09 16:19:34 +00:00
John Baldwin
ae8dde30c2 Use C99-style struct member initialization for lock classes. 2007-03-09 16:04:44 +00:00
Pawel Jakub Dawidek
2709e8904f Minor simplification. 2007-03-09 05:22:10 +00:00
Mohan Srinivasan
f9bb753844 Over NFS, an open() call could result in multiple over-the-wire
GETATTRs being generated - one from lookup()/namei() and the other
from nfs_open() (for cto consistency). This change eliminates the
GETATTR in nfs_open() if an otw GETATTR was done from the namei()
path. Instead of extending the vop interface, we timestamp each attr
load, and use this to detect whether a GETATTR was done from namei()
for this syscall. Introduces a thread-local variable that counts the
syscalls made by the thread and uses <pid, tid, thread syscalls> as
the attrload timestamp. Thanks to jhb@ and peter@ for a discussion on
thread state that could be used as the timestamp with minimal overhead.
2007-03-09 04:02:38 +00:00
Julian Elischer
486a941418 Instead of doing comparisons using the pcpu area to see if
a thread is an idle thread, just see if it has the IDLETD
flag set. That flag will probably move to the pflags word
as it's permenent and never chenges for the life of the
system so it doesn't need locking.
2007-03-08 06:44:34 +00:00
Pawel Jakub Dawidek
9e5dcf7b21 White space nits. 2007-03-07 21:24:51 +00:00
John Baldwin
ddb38a1f3d Fix some nits in lock profiling for rwlocks:
- Properly note when a read lock is released.
- Always note when we contest on a read lock.
- Only note success of obtaining read locks for the first reader to match
  the behavior of sx(9).

Reviewed by:	kmacy
2007-03-07 20:48:48 +00:00
Julian Elischer
1d820a15b8 After the last change to KSE threading a bug was introduced where
all threads were counted against the count of upcall capable threads.
this changes the way we do this accounting.
2007-03-07 20:17:41 +00:00
Olivier Houchard
aed12d5ff8 Backout rev 1.17, msleep() can't be used with a spinlock.
Pointy hat to:	cognet
2007-03-06 12:08:38 +00:00
Robert Watson
b5368498b5 Replay minor system call comment cleanup applied to kern_acl.c in a race
with repo-copy of kern_acl.c to vfs_acl.c.
2007-03-05 13:26:07 +00:00
Robert Watson
e6f5470468 Recognize repo-copy of kern_acl.c to vfs_acl.c, remove kern_acl.c,
remove kern_acl.c from the build, connect vfs_acl.c to the build.

Thanks to:	joe
2007-03-05 13:24:01 +00:00
Robert Watson
873fbcd776 Further system call comment cleanup:
- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
  "syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.
2007-03-05 13:10:58 +00:00
Wojciech A. Koszek
59f65a4ba6 Change these descriptions of memory types used in malloc(9), as their
current, rather long strings make output from vmstat -m look unpleasant.

Approved by:	cognet (mentor)
2007-03-05 00:21:40 +00:00
Wojciech A. Koszek
d348f4d384 Use msleep(9) instead of tsleep(9) surrounded by lock acquisition and
release.

Approved by:	cognet (mentor)
2007-03-04 23:40:35 +00:00
Robert Watson
0c14ff0eb5 Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.
2007-03-04 22:36:48 +00:00
Robert Watson
1a5d072b76 Move to ANSI C function headers. Re-wrap some comments. 2007-03-04 17:50:46 +00:00
John Baldwin
e41bcf3cfc - Don't do the interrupt storm protection stuff for software interrupt
handlers.
- Use pause() when throtting during an interrupt storm.

Reported by:	kris (1)
2007-03-02 17:01:45 +00:00
Kip Macy
c66d760608 lock stats updates need to be protected by the lock 2007-03-02 07:21:20 +00:00
Pawel Jakub Dawidek
bb531912ff Rename PRIV_VFS_CLEARSUGID to PRIV_VFS_RETAINSUGID, which seems to better
describe the privilege.

OK'ed by:	rwatson
2007-03-01 20:47:42 +00:00
Bruce M Simpson
6f7ca813c4 Do not dispatch SIGPIPE from the generic write path for a socket; with
this patch the code behaves according to the comment on the line above.

Without this patch, a socket could cause SIGPIPE to be delivered to its
process, once with SO_NOSIGPIPE set, and twice without.

With this patch, the kernel now passes the sigpipe regression test.

Tested by:	Anton Yuzhaninov
MFC after:	1 week
2007-03-01 19:20:25 +00:00
Kip Macy
a5bceb77f2 Evidently I've overestimated gcc's ability to peak inside inline functions
and optimize away unused stack values. The 48 bytes that the lock_profile_object
adds to the stack evidently has a measurable performance impact on certain workloads.
2007-03-01 09:35:48 +00:00
Robert Watson
ede6e136f8 Remove two simultaneous acquisitions of multiple unpcb locks from
uipc_send in cases where only a global read lock is held by breaking
them out and avoiding the unpcb lock acquire in the common case.  This
avoids deadlocks which manifested with X11, and should also marginally
further improve performance.

Reported by:	sepotvin, brooks
2007-03-01 09:00:42 +00:00
Robert Watson
3592fd4de5 Lock unp2 after checking for a non-NULL unp2 pointer in uipc_send() on
datagram UNIX domain sockets, not before.
2007-02-28 08:08:50 +00:00
John Baldwin
1a4435ee0e Print tid's rather than thread pointers in KTR_PROC traces. 2007-02-27 18:46:07 +00:00
John Baldwin
4d70511ac3 Use pause() rather than tsleep() on stack variables and function pointers. 2007-02-27 17:23:29 +00:00
John Baldwin
84d37a463a Use pause() rather than tsleep() on explicit global dummy variables. 2007-02-27 17:22:30 +00:00
Paolo Pisati
f2d619c8b1 Do not execute filter only handlers in ithread_execute_handlers():
this fixes the panics when filter only and ithread only handlers where
sharing the same irq .
2007-02-27 17:09:20 +00:00
Kip Macy
f183910b97 Further improvements to LOCK_PROFILING:
- Fix missing initialization in kern_rwlock.c causing bogus times to be collected
 - Move updates to the lock hash to after the lock is released for spin mutexes,
   sleep mutexes, and sx locks
 - Add new kernel build option LOCK_PROFILE_FAST - only update lock profiling
   statistics when an acquisition is contended. This reduces the overhead of
   LOCK_PROFILING to increasing system time by 20%-25% which on
   "make -j8 kernel-toolchain" on a dual woodcrest is unmeasurable in terms
   of wall-clock time. Contrast this to enabling lock profiling without
   LOCK_PROFILE_FAST and I see a 5x-6x slowdown in wall-clock time.
2007-02-27 06:42:05 +00:00
Robert Watson
e7c33e29ed Revise locking strategy used for UNIX domain sockets in order to improve
concurrency:

- Add per-unpcb mutexes protecting unpcb connection state, fields, etc.

- Replace global UNP mutex with a global UNP rwlock, which will protect the
  UNIX domain socket connection topology, v_socket, and be acquired
  exclusively before acquiring more than per-unpcb at a time in order to
  avoid lock order issues.

In performance measurements involving MySQL, this change has little or no
overhead on UP (+/- 1%), but leads to a significant (5%-30%) improvement in
multi-processor measurements using the sysbench and supersmack benchmarks.

Much testing by:	kris
Approved by:		re (kensmith)
2007-02-26 20:47:52 +00:00
John Baldwin
c0e767f9dd Use NULL rather than 0 for various pointer constants. 2007-02-26 19:28:18 +00:00
Robert Watson
8525230afd Add rw_wowned() interface to rwlock(9), allowing a kernel thread to
determine if it holds an exclusive rwlock reference or not.  This is
non-ideal, but recursion scenarios in the network stack currently
require it.

Approved by:	jhb
2007-02-26 19:05:13 +00:00
John Baldwin
59800afcb5 Mark the kernel linker file as linked so that it is visible to the various
kld*() syscalls.

Tested by:	piso
2007-02-26 16:48:14 +00:00
John Baldwin
4a0f58d25b Fix a comment. 2007-02-26 16:36:48 +00:00
Ruslan Ermilov
fac61393b9 Don't block on the socket zone limit during the socket()
call which can easily lock up a system otherwise; instead,
return ENOBUFS as documented in a manpage, thus reverting
us to the FreeBSD 4.x behavior.

Reviewed by:	rwatson
MFC after:	2 weeks
2007-02-26 10:45:21 +00:00
Kip Macy
fe68a91631 general LOCK_PROFILING cleanup
- only collect timestamps when a lock is contested - this reduces the overhead
  of collecting profiles from 20x to 5x

- remove unused function from subr_lock.c

- generalize cnt_hold and cnt_lock statistics to be kept for all locks

- NOTE: rwlock profiling generates invalid statistics (and most likely always has)
  someone familiar with that should review
2007-02-26 08:26:44 +00:00
Xin LI
1ad9ee8603 Close race conditions between fork() and [sg]etpriority()'s
PRIO_USER case, possibly also other places that deferences
p_ucred.

In the past, we insert a new process into the allproc list right
after PID allocation, and release the allproc_lock sx.  Because
most content in new proc's structure is not yet initialized,
this could lead to undefined result if we do not handle PRS_NEW
with care.

The problem with PRS_NEW state is that it does not provide fine
grained information about how much initialization is done for a
new process.  By defination, after PRIO_USER setpriority(), all
processes that belongs to given user should have their nice value
set to the specified value.  Therefore, if p_{start,end}copy
section was done for a PRS_NEW process, we can not safely ignore
it because p_nice is in this area.  On the other hand, we should
be careful on PRS_NEW processes because we do not allow non-root
users to lower their nice values, and without a successful copy
of the copy section, we can get stale values that is inherted
from the uninitialized area of the process structure.

This commit tries to close the race condition by grabbing proc
mutex *before* we release allproc_lock xlock, and do copy as
well as zero immediately after the allproc_lock xunlock.  This
guarantees that the new process would have its p_copy and p_zero
sections, as well as user credential informaion initialized.  In
getpriority() case, instead of grabbing PROC_LOCK for a PRS_NEW
process, we just skip the process in question, because it does
not affect the final result of the call, as the p_nice value
would be copied from its parent, and we will see it during
allproc traverse.

Other potential solutions are still under evaluation.

Discussed with:	davidxu, jhb, rwatson
PR:		kern/108071
MFC after:	2 weeks
2007-02-26 03:38:09 +00:00
Scott Long
04f0ce213f Fix a case in rman_manage_region() where the resource list would get missorted.
This would in turn confuse rman_reserve_resource().  This was only seen for
MSI resources that can get allocated and deallocated after boot.
2007-02-23 22:53:56 +00:00
John Baldwin
498eccc919 Drop the global kernel linker lock while executing the sysinit's for a
freshly-loaded kernel module.  To avoid various unload races, hide linker
files whose sysinit's are being run from userland so that they can't be
kldunloaded until after all the sysinit's have finished.

Tested by:	gallatin
2007-02-23 19:46:59 +00:00
John Baldwin
37e80fcac2 Add a new kernel sleep function pause(9). pause(9) is for places that
want an equivalent of DELAY(9) that sleeps instead of spins.  It accepts
a wmesg and a timeout and is not interrupted by signals.  It uses a private
wait channel that should never be woken up by wakeup(9) or wakeup_one(9).

Glanced at by:	phk
2007-02-23 16:22:09 +00:00
Paolo Pisati
ef544f6312 o break newbus api: add a new argument of type driver_filter_t to
bus_setup_intr()

o add an int return code to all fast handlers

o retire INTR_FAST/IH_FAST

For more info: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=465712+0+current/freebsd-current

Reviewed by: many
Approved by: re@
2007-02-23 12:19:07 +00:00
Xin LI
74f094f6a4 Use LIST_EMPTY() instead of unrolled version (LIST_FIRST() [!=]= NULL) 2007-02-22 14:52:59 +00:00
Robert Watson
6fac927ccc Add an additional MAC check to the UNIX domain socket connect path:
check that the subject has read/write access to the vnode using the
vnode MAC check.

MFC after:	3 weeks
Submitted by:	Spencer Minear <spencer_minear at securecomputing dot com>
Obtained from:	TrustedBSD Project
2007-02-22 09:37:44 +00:00
Robert Watson
7ee76f9d4e Remove unnecessary privilege and privilege check for WITNESS sysctl.
Head nod:	jhb
2007-02-20 23:49:31 +00:00
Robert Watson
5b950deabc Break introductory comment into two paragraphs to separate material on the
garbage collection complications from general discussion of UNIX domain
sockets.

Staticize unp_addsockcred().

Remove XXX comment regarding Giant and v_socket -- v_socket is protected
by the global UNIX domain socket lock.
2007-02-20 10:50:02 +00:00
Robert Watson
95420afea4 Remove unused PRIV_IPC_EXEC. Renumbers System V IPC privilege. 2007-02-20 00:12:52 +00:00
Robert Watson
2390d78f74 Sync up PRIV_IPC_{ADMIN,READ,WRITE} priv checks in ipcperm() with
kern_jail.c: allow jailed root these privileges.  This only has an
effect if System V IPC is administratively enabled for the jail.
2007-02-20 00:06:59 +00:00
Robert Watson
b12c55ab92 Restore sysv_ipc.c:1.30, which was backed out due to interactions with
System V shared memory, now believed fixed in sysv_shm.c:1.109:

  date: 2006/11/06 13:42:01;  author: rwatson;  state: Exp;  lines: +65 -37
  Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
  specific privilege names to a broad range of privileges.  These may
  require some future tweaking.

  Sponsored by:           nCircle Network Security, Inc.
  Obtained from:          TrustedBSD Project
  Discussed on:           arch@
  Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                          Alex Lyashkov <umka at sevcity dot net>,
                          Skip Ford <skip dot ford at verizon dot net>,
                          Antoine Brodin <antoine dot brodin at laposte dot net>

This restores fine-grained privilege support to System V IPC.

PR:	106078
2007-02-19 22:59:23 +00:00
Robert Watson
3d50b06b8e Remove call to ipcperm() in shmget_existing(). The flags argument is
ignored on other systems I investigated when accessing an existing
memory segment rather than creating a new one.  This call to ipcperm()
is the only one to pass in a complete mode flag to the permission
checks rather than a simple access request mask, and caused problems
for the revised ipcperm() based on the priv(9) interface, which can
now be restored.

PR:	106078
2007-02-19 22:56:10 +00:00
Robert Watson
95b091d2f2 Rename three quota privileges from the UFS privilege namespace to the
VFS privilege namespace: exceedquota, getquota, and setquota.  Leave
UFS-specific quota configuration privileges in the UFS name space.

This renumbers VFS and UFS privileges, so requires rebuilding modules
if you are using security policies aware of privilege identifiers.
This is likely no one at this point since none of the committed MAC
policies use the privilege checks.
2007-02-19 13:33:10 +00:00
Robert Watson
e82d0201bd Limit quota privileges in jail to PRIV_UFS_GETQUOTA and
PRIV_UFS_SETQUOTA.
2007-02-19 13:26:39 +00:00
Robert Watson
ea04d82da8 Do allow privilege to create over-sized messages on System V IPC
message queues in jail.
2007-02-19 13:23:45 +00:00
Robert Watson
86138fc742 Use priv_check(9) instead of suser(9) for checking the privilege to
set real-time priority on a thread.  It looks like this suser(9)
call was introduced after my first pass through replacing superuser
checks with named privilege checks.
2007-02-19 13:22:36 +00:00
Robert Watson
c3c1b5e62a For now, reflect practical reality that Audit system calls aren't
allowed in Jail: return a privilege error.
2007-02-19 13:10:29 +00:00
Konstantin Belousov
9b2f1a0740 Remove union_dircheckp hook, it is not needed by new unionfs code anymore.
As consequence, getdirentries() no longer needs to drop/reacquire
directory vnode lock, that would allow it to be reclaimed in between.

Reported and tested by:	Peter Holm
Approved by:		rodrigc (unionfs)
MFC after:		1 week
2007-02-19 10:56:09 +00:00
Pawel Jakub Dawidek
2c7b0f41ec Remove VFS_VPTOFH entirely. API is already broken and it is good time to
do it.

Suggested by:	rwatson
2007-02-16 17:32:41 +00:00
Pawel Jakub Dawidek
10bcafe9ab Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by:	mckusick
Discussed with:	many (on IRC)
Tested with:	ufs, msdosfs, cd9660, nullfs and zfs
2007-02-15 22:08:35 +00:00
Luigi Rizzo
33d5497079 Cleanup and document the implementation of firmware(9) based on
a version that i posted earlier on the -current mailing list,
and subsequent feedback received.

The core of the change is just in sys/firmware.h and kern/subr_firmware.c,
while other files are just adaptation of the clients to the ABI change
(const-ification of some parameters and hiding of internal info,
so this is fully compatible at the binary level).

In detail:
- reduce the amount of information exported to clients in struct firmware,
  and constify the pointer;

- internally, document and simplify the implementation of the various
  functions, and make sure error conditions are dealt with properly.

The diffs are large, but the code is really straightforward now (i hope).

Note also that there is a subtle issue with the implementation of
firmware_register(): currently, as in the previous version, we just
store a reference to the 'imagename' argument, but we should rather
copy it because there is no guarantee that this is a static string.
I realised this while testing this code, but i prefer to fix it in
a later commit -- there is no regression with respect to the past.

Note, too, that the version in RELENG_6 has various bugs including
missing locks around the module release calls, mishandling of modules
loaded by /boot/loader, and so on, so an MFC is absolutely necessary
there.  I was just postponing it until this cleanup to avoid doing
things twice.

MFC after: 1 week
2007-02-15 17:21:31 +00:00
Robert Watson
780a98ad1f Catch up file descriptor printing function in DDB to the addition of kqueues
and POSIX message queues.
2007-02-15 10:55:43 +00:00
Robert Watson
442f65e958 Break file descriptor printing logic out of db_show_files() into
db_print_file(), and add a new "show file <ptr>" DDB command, which can
be used to print out file descriptors referenced in stack traces.
2007-02-15 10:50:48 +00:00
Robert Watson
f58dd47091 Rename somaxconn_sysctl() to sysctl_somaxconn() so that I will be able to
claim that sofoo() functions all accept a socket as their first argument.
2007-02-15 10:11:00 +00:00
Konstantin Belousov
478a8db4ce If both ISDOTDOT and NOCROSSMOUNT are set then lookup() might breaks out
of the special handling for ".." and perform an ISDOTDOT VOP_LOOKUP()
for a filesystem root vnode. Handle this case inside lookup().

Submitted by:	tegge
PR:		92785
MFC after:	1 week
2007-02-15 09:53:49 +00:00
Robert Watson
c3b162d54e Teach DDB how to print sockets, socket buffers, protosw's, and domain
structures given pointers to them.
2007-02-15 01:28:22 +00:00
Robert Watson
aea52f1bf8 Minor rearrangement of global variables, comments, etc, in UNIX domain
sockets.
2007-02-14 15:05:40 +00:00
Robert Watson
46a1d9bfe8 Change unp_mtx to supporting recursion, and do not drop the unp_mtx over
sonewconn() in unp_connect().  This avoids a race that occurs due to
v_socket being an uncounted reference, as the lock was being released in
order to call sonewconn(), which otherwise recurses into the UNIX domain
socket code via pru_attach, as well as holding the lock over a sleeping
memory allocation in uipc_attach().  Switch to a non-sleeping memory
allocation during UNIX domain socket attach.

This fix non-ideal in that it requires enabling recursion, but is a much
smaller change than moving to using true references for v_socket.  The
reported panic occurs in unp_connect() following the return of
sonewconn().

Update copyright year.

Panic reported by:      jhb
2007-02-14 12:22:11 +00:00
Robert Watson
05102f04d5 Set UNP_CONNECTING when committing to moving ahead in unp_connect().
This logic was lost when merging the remainder of these changes in
1.178.
2007-02-13 21:00:57 +00:00
Olivier Houchard
38cc2a5caa Make vfs_getopts() set *error to ENOENT if the option wasn't found, so that
consumers don't have to check for both error and the return value (some of
them actually don't do it).

MFC After:	1 week
2007-02-13 01:28:48 +00:00
Mike Pritchard
51fd6380c5 Do not do a vn_close for all references to the ktraced file if we are
doing a CLEARFILE option.  Do a vrele instead.  This prevents
a panic later due to v_writecount being negative when the vnode
is taken off the freelist.

Submitted by:	jhb
2007-02-13 00:20:13 +00:00
Mike Pritchard
87aabdc126 Add a VNASSERT to vn_close to detect if v_writecount is going
to become negative.  This will detect the underflow when it
happens, instead of having it discovered when the vnode is
taken off the freelist, long after the offending process is long
gone.
2007-02-12 22:53:01 +00:00
Craig Rodrigues
d139ce67c0 Makefile changes to reflect moving sys/isofs/cd9660 to sys/fs/cd9660.
Continue to install userland include files in /usr/include/isofs/cd9660
so as not to break userland applications such as libstand.
2007-02-11 14:01:32 +00:00
Xin LI
d60226bd43 Give which signal caller has attempted to deliver when panicking. 2007-02-09 17:48:28 +00:00
Jeff Roberson
ed0e8f2fe9 - Change types for necent runq additions to u_char rather than int.
- Fix these types in ULE as well.  This fixes bugs in priority index
   calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.

Reported by:	kkenn using pho's stress2.
2007-02-08 01:52:25 +00:00
Alan Cox
0e2056ee7f Remove the vm page queue free mutex from the CDEV order. 2007-02-07 05:43:31 +00:00
Robert Watson
1f837c4753 Push UNIX domain socket locking further into uipc_ctloutput() in order to
avoid holding the UNIX domain socket subsystem lock over soooptcopyin()
and sooptcopyout().  This problem was introduced when LOCAL_CREDS, and
LOCAL_CONNWAIT support were added.

Reviewed by:	mdodd
2007-02-06 14:31:37 +00:00
Mike Pritchard
af7a34173d The change to the vm_page_queue_freelist lock from a spin lock to a
sleep lock missed the witness code, and the system will panic
immediately on boot if WITNESS is enabled.

Changed the witness definition to the new type.
2007-02-06 05:51:55 +00:00
Max Laier
38d4db193b Add a small informative printf under bootverbose to firmware_register to
track problems when loading firmware from loader.
2007-02-03 16:01:46 +00:00
Bruce M Simpson
7dc8d021ea Diff reduction with RELENG_6, style(9):
Remove unnecessary brace; && should be on end of line.
No functional changes.
2007-02-03 03:57:45 +00:00
Bruce M Simpson
217f71d80c Use int instead of u_int for the 'extra' argument to the
clone_create() KPI.
This fixes a signedness bug in unit number comparisons.

Submitted by:	imp, Landon Fuller
PR:		kern/105228
MFC after:	2 weeks
2007-02-02 22:27:45 +00:00
Konstantin Belousov
e6a4f4cd40 Record kqueue -> struct mount mtx -> vnode interlock lock order to
catch the places where reverse lock order is instantiated.

OKed by:	jeff
2007-02-02 09:02:18 +00:00
Julian Elischer
c6226eea4c Move the seting of the idle_mask bits to a place where they
can't be wrong.
Also use the IDLETD bit in the thread mask to test if its an idle thread
rather than doing a PCPU access.
2007-02-02 05:14:22 +00:00
Andre Oppermann
6a37f331d7 Generic socket buffer auto sizing support, header defines, flag inheritance.
MFC after:	1 month
2007-02-01 17:53:41 +00:00
Max Laier
191c2cea1c In case we are supplied with an imagename that matches a module, but not a
firmware in that module (eventhough this is a programming error) - drop the
reference to the module again.

Submitted by:	Benjamin Close
MFC after:	3 days
2007-01-27 19:52:08 +00:00
Jeff Roberson
fc3a97dcb7 - Implement much more intelligent ipi sending. This algorithm tries to
minimize IPIs and rescheduling when scheduling like tasks while keeping
   latency low for important threads.
   1) An idle thread is running.
   2) The current thread is worse than realtime and the new thread is
      better than realtime.  Realtime to realtime doesn't preempt.
   3) The new thread's priority is less than the threshold.
2007-01-25 23:51:59 +00:00
Jeff Roberson
1461899028 - Get rid of the unused DIDRUN flag. This was really only present to
support sched_4bsd.
 - Rename the KTR level for non schedgraph parsed events.  They take event
   space from things we'd like to graph.
 - Reset our slice value after we sleep.  The slice is simply there to
   prevent starvation among equal priorities.  A thread which had almost
   exhausted it's slice and then slept doesn't need to be rescheduled a
   tick after it wakes up.
 - Set the maximum slice value to a more conservative 100ms now that it is
   more accurately enforced.
2007-01-25 19:14:11 +00:00
Mohan Srinivasan
6c125b8df6 Fix for problems that occur when all mbuf clusters migrate to the mbuf packet
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.

Reviewed by: andre@
2007-01-25 01:05:23 +00:00
Jeff Roberson
9a93305a2e - With a sleep time over 2097 seconds hzticks and slptime could end up
negative.  Use unsigned integers for sleep and run time so this doesn't
   disturb sched_interact_score().  This should fix the invalid interactive
   priority panics reported by several users.
2007-01-24 18:18:43 +00:00
Randall Stewart
6dbde03086 Fixes the MSG_PEEK for sctp_generic_recvmsg() the msg_flags
were not being copied in properly so PEEK and any other
msg_flags input operation were not being performed right.
Approved by:	gnn
2007-01-24 12:59:56 +00:00
Konstantin Belousov
2cc7d26f7f Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by:	tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by:	Peter Holm
X-MFC after:	3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
Jeff Roberson
7a5e5e2a59 - Catch up to setrunqueue/choosethread/etc. api changes.
- Define our own maybe_preempt() as sched_preempt().  We want to be able
   to preempt idlethread in all cases.
 - Define our idlethread to require preemption to exit.
 - Get the cpu estimation tick from sched_tick() so we don't have to worry
   about errors from a sampling interval that differs from the time
   domain.  This was the source of sched_priority prints/panics and
   inaccurate pctcpu display in top.
2007-01-23 08:50:34 +00:00
Jeff Roberson
f0393f063a - Remove setrunqueue and replace it with direct calls to sched_add().
setrunqueue() was mostly empty.  The few asserts and thread state
   setting were moved to the individual schedulers.  sched_add() was
   chosen to displace it for naming consistency reasons.
 - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
   different on all three schedulers where it was only called in one place
   each.
 - Remove the long ifdef'd out remrunqueue code.
 - Remove the now redundant ts_state.  Inspect the thread state directly.
 - Don't set TSF_* flags from kern_switch.c, we were only doing this to
   support a feature in one scheduler.
 - Change sched_choose() to return a thread rather than a td_sched.  Also,
   rely on the schedulers to return the idlethread.  This simplifies the
   logic in choosethread().  Aside from the run queue links kern_switch.c
   mostly does not care about the contents of td_sched.

Discussed with:	julian

 - Move the idle thread loop into the per scheduler area.  ULE wants to
   do something different from the other schedulers.

Suggested by:	jhb

Tested on:	x86/amd64 sched_{4BSD, ULE, CORE}.
2007-01-23 08:46:51 +00:00
Craig Rodrigues
61e323a2fa When exiting vfs_export(), delete the "export" option from
the mount options list with vfs_deleteopt().  At this point, the export
information is saved in mp->mnt_export, so we can delete
the "export" mount option from mp->mnt_optnew and mp->mnt_opt.

This fixes read-write/read-only update mounts (mount -u -o rw, mount -u -o ro)
of NFS exported directories.

For some reason, I could only reproduce the problem with a configuration
supplied by Andre:
- "options QUOTA" enabled in kernel config
- "/ -maproot=root 10.0.1.105" in /etc/exports

Reported by:	kris, Andre Guibert de Bruet <andy siliconlandmark com>,
            	Andrzej Tobola <ato iem pw edu pl>
Tested by:	Andre Guibert de Bruet
2007-01-23 06:19:16 +00:00
Andre Oppermann
7c32173ba8 Unbreak writes of 0 bytes. Zero byte writes happen when only ancillary
control data but no payload data is passed.

Change m_uiotombuf() to return at least one empty mbuf if the requested
length was zero.  Add comment to sosend_dgram and sosend_generic().

Diagnoses by:		jhb
Regression test by:	rwatson
Pointy hat to.		andre
2007-01-22 14:50:28 +00:00
Konstantin Belousov
7f92c4ee02 Below is slightly edited description of the LOR by Tor Egge:
--------------------------
[Deadlock] is caused by a lock order reversal in vfs_lookup(), where
[some] process is trying to lock a directory vnode, that is the parent
directory of covered vnode) while holding an exclusive vnode lock on
covering vnode.

A simplified scenario:

root fs					var fs
/    		A			/    (/var)	D
/var		B			/log (/var/log) E
vfs lock	C			vfs lock	F

Within each file system, the lock order is clear: C->A->B and F->D->E

When traversing across mounts, the system can choose between two lock orders,
but everything must then follow that lock order:

      L1: C->A->B
		|
	        +->F->D->E

      L2: F->D->E
	     |
             +->C->A->B

The lookup() process for namei("/var") mixes those two lock orders:

    VOP_LOOKUP() obtains B while A is held
    vfs_busy() obtains a shared lock on F while A and B are held (follows L1,
    violates L2)
    vput() releases lock on B
    VOP_UNLOCK() releases lock on A
    VFS_ROOT() obtains lock on D while shared lock on F is held
    vfs_unbusy() releases shared lock on F
    vn_lock() obtains lock on A while D is held (violates L1, follows L2)

dounmount() follows L1 (B is locked while F is drained).

Without unmount activity, vfs_busy() will always succeed without blocking
and the deadlock isn't triggered (the system behaves as if L2 is followed).

With unmount, you can get 4 processes in a deadlock:

     p1: holds D, want A (in lookup())
     p2: holds shared lock on F, want D (in VFS_ROOT())
     p3: holds B, want drain lock on F (in dounmount())
     p4: holds A, want B (in VOP_LOOKUP())

You can have more than one instance of p2.

The reversal was introduced in revision 1.81 of src/sys/kern/vfs_lookup.c and
MFCed to revision 1.80.2.1, probably to avoid a cascade of vnode locks when nfs
servers are dead (VFS_ROOT() just hangs) spreading to the root fs root vnode.

- Tor Egge

To fix the LOR, ups@ noted that when crossing the mount point, ni_dvp
is actually not used by the callers of namei. Thus, placeholder deadfs
vnode vp_crossmp is introduced that is filled into ni_dvp.

Idea by:	ups
Reviewed by:	tegge, ups, jeff, rwatson (mac interaction)
Tested by:	Peter Holm
MFC after:	2 weeks
2007-01-22 11:25:22 +00:00
Jeff Roberson
5cea64d54f - Disable the long-term load balancer. I believe that steal_busy works
better and gives more predictable results.
2007-01-20 21:24:05 +00:00
Jeff Roberson
c95d2db298 - We do need to IPI the idlethread on some systems. It may be stuck in
a power saving mode otherwise.
 - If the thread is already bound in sched_bind() unbind it before
   re-binding it to a new cpu.  I don't like these semantics but they are
   expected by some code in the tree.  Patch by jkoshy.
2007-01-20 17:03:33 +00:00
Jeff Roberson
6b2f763f7c - In tdq_transfer() always set NEEDRESCHED when necessary regardless of
the ipi settings.  If NEEDRESCHED is set and an ipi is later delivered
   it will clear it rather than cause extra context switches.  However, if
   we miss setting it we can have terrible latency.
 - In sched_bind() correctly implement bind.  Also be slightly more
   tolerant of code which calls bind multiple times.  However, we don't
   change binding if another call is made with a different cpu.  This
   does not presently work with hwpmc which I believe should be changed.
2007-01-20 09:03:43 +00:00
Jeff Roberson
7b8bfa0de9 Major revamp of ULE's cpu load balancing:
- Switch back to direct modification of remote CPU run queues.  This added
   a lot of complexity with questionable gain.  It's easy enough to
   reimplement if it's shown to help on huge machines.
 - Re-implement the old tdq_transfer() call as tdq_pickidle().  Change
   sched_add() so we have selectable cpu choosers and simplify the logic
   a bit here.
 - Implement tdq_pickpri() as the new default cpu chooser.  This algorithm
   is similar to Solaris in that it tries to always run the threads with
   the best priorities.  It is actually slightly more complex than
   solaris's algorithm because we also tend to favor the local cpu over
   other cpus which has a boost in latency but also potentially enables
   cache sharing between the waking thread and the woken thread.
 - Add a bunch of tunables that can be used to measure effects of different
   load balancing strategies.  Most of these will go away once the
   algorithm is more definite.
 - Add a new mechanism to steal threads from busy cpus when we idle.  This
   is enabled with kern.sched.steal_busy and kern.sched.busy_thresh.  The
   threshold is the required length of a tdq's run queue before another
   cpu will be able to steal runnable threads.  This prevents most queue
   imbalances that contribute the long latencies.
2007-01-19 21:56:08 +00:00
Xin LI
4f506694bb Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form. 2007-01-17 14:58:53 +00:00
Suleiman Souhlal
e8ac01c56a Remove hptlock from the static witness table, now that it's a regular sleep
mutex.
2007-01-16 22:56:28 +00:00
Randall Stewart
9b3386570c Removes useless (flags | ) KASSERT. The ^ one that actually
does what we want.

Submitted by:	Li Xin delphij@delphij.net
Reviewed by:	rrs
Approved by:	gnn
2007-01-16 11:40:55 +00:00
Kip Macy
e440d8fff5 Fix warning by adding extra parentheses 2007-01-16 00:09:58 +00:00
Randall Stewart
b939bb368a Reviewed by: rwatson
Approved by:	gnn

Add a new function hashinit_flags() which allows NOT-waiting
for memory (or waiting). The old hashinit() function now
calls hashinit_flags(..., HASH_WAITOK);
2007-01-15 15:06:28 +00:00
Robert Watson
b0c521e29c Re-wrap comments to wider margins now that they have been relocated from
within functions.
2007-01-12 22:01:03 +00:00
Warner Losh
fe18f3853e When ntp_gettime() was converted from a sysctl + wrapper to a system
call, its semantics were unintentionally changed.  It went from
returning the time state to returning 0 or -1.  Since 0 means time
normal, and non-zero effectively only shows up around leap seconds,
this went unnoticed until now.  At least unnoticed until someone was
trying to run a binary they didn't have source for and it was
misbehaving...

Submitted by: Judah Levine
MFC After: 2 weeks
2007-01-12 07:40:30 +00:00
John Baldwin
19c80b2652 Wrap propagate_priority() in a critical section to prevent unwanted
preemptions when adjusting the priority of a thread that is on a run
queue.  This was only observed when FULL_PREEMPTION was enabled.

Reported by:	kris
Diagnosed by:	ups
MFC after:	1 week
2007-01-11 19:13:27 +00:00
Robert Watson
ef08c42034 Sort copyrights together.
MFC after:	3 days
2007-01-08 20:37:02 +00:00
Robert Watson
fcdc50ebc1 Resort copyrights and licenses in kern_acct.c: per UCB letter,
the UCB license now excludes the advertising clause.  I'm not
interested in it either, so move my copyright.  This leaves
only a CGD copyright with the advertising clause.

MFC after:      3 days
2007-01-08 20:35:13 +00:00
Robert Watson
abdeb3b01f Canonicalize copyrights in some files I hold copyrights on:
- Sort by date in license blocks, oldest copyright first.
- All rights reserved after all copyrights, not just the first.
- Use (c) to be consistent with other entries.

MFC after:	3 days
2007-01-08 17:49:59 +00:00
Jeff Roberson
eddb4efacd - Don't let SCHED_TICK_TOTAL() return less than hz. This can cause integer
divide faults in roundup() later if it is able to return 0.  For some
   reason this bug only shows up on my laptop and not my testboxes.
2007-01-06 12:33:43 +00:00
Jeff Roberson
1e516cf534 - Fix the sched_priority() invalid priority bugs. Use roundup() instead
of max() when computing the divisor in SCHED_TICK_PRI().  This prevents
   cases where rounding down would allow the quotient to exceed
   SCHED_PRI_RANGE.
 - Garbage collect some unused flags and fields.
 - Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply
   duplicated this functionality.
 - Re-enable the rebalancer by default and fix the sysctl so it can be
   modified.
2007-01-06 08:44:13 +00:00
Jeff Roberson
9330bbbb61 - Don't IPI unless we're going to interrupt something exiting in the kernel.
otherwise we can afford the latency.  This makes a significant performance
   improvement.
2007-01-06 02:34:23 +00:00
Jeff Roberson
155b6ca12b - Fix a comparison in sched_choose() that caused cpus to be constantly
marked idle, thus breaking cpu load balancing.
 - Change sched_interact_update() to fix cases where the stored history
   has expanded significantly rather than handling them in the callers.  This
   fixes a case where sched_priority() could compute a bad value.
 - Add a sysctl to disable the global load balancer for experimentation.
2007-01-05 23:45:38 +00:00
John Baldwin
9ae328fc8f - Close a race between enumerating UNIX domain socket pcb structures via
sysctl and socket teardown by adding a reference count to the UNIX domain
  pcb object and fixing the sysctl that enumerates unpcbs to grab a
  reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
  file descriptor teardown (fdrop()) by adding a new garbage collection
  flag FWAIT.  unp_gc() sets FWAIT while it walks the message buffers
  in a UNIX domain socket looking for nested file descriptor references
  and clears the flag when it is finished.  fdrop() checks to see if the
  flag is set on a file descriptor whose refcount just dropped to 0 and
  waits for unp_gc() to clear the flag before completely destroying the
  file descriptor.

MFC after:	1 week
Reviewed by:	rwatson
Submitted by:	ups
Hopefully makes the panics go away:	mx1
2007-01-05 19:59:46 +00:00
Jeff Roberson
8ab80cf009 - ftick was initialized to -1 for init and any of it's children. Fix this by
setting ftick = ltick = ticks in schedinit().
 - Update the priority when we are pulled off of the run queue and when we
   are inserted onto the run queue so that it more accurately reflects our
   present status.  This is important for efficient priority propagation
   functioning.
 - Move the frequency test into sched_pctcpu_update() so we don't repeat it
   each time we'd like to call it.
 - Put some temporary work-around code in sched_priority() in case the tick
   mechanism produces a bad priority.  Eventually this should revert to an
   assert again.
2007-01-05 08:50:38 +00:00