unsigned char. Weirdly, casting the 1 constant to u_char still produces
a signed integer result that is then used in the % computation. This
avoids that mess all together and causes a 0 pri to turn into 255 % 64
as we expect.
Reported by: kkenn (about 4 times, thanks)
late stages of unmount). On failure, the vnode is recycled.
Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.
Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.
Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.
Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.
Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib
sosend_copyin().
- Use M_WAITOK instead of M_TRYWAIT in sosend_copyin().
- Don't check for NULL from M_WAITOK and return ENOBUFS.
M_WAITOK/M_TRYWAIT allocations don't fail with NULL.
Reviewed by: andre
Requested by: andre (2)
event. Locking primitives that support this (mtx, rw, and sx) now each
include their own foo_sleep() routine.
- Rename msleep() to _sleep() and change it's 'struct mtx' object to a
'struct lock_object' pointer. _sleep() uses the recently added
lc_unlock() and lc_lock() function pointers for the lock class of the
specified lock to release the lock while the thread is suspended.
- Add wrappers around _sleep() for mutexes (mtx_sleep()), rw locks
(rw_sleep()), and sx locks (sx_sleep()). msleep() still exists and
is now identical to mtx_sleep(), but it is deprecated.
- Rename SLEEPQ_MSLEEP to SLEEPQ_SLEEP.
- Rewrite much of sleep.9 to not be msleep(9) centric.
- Flesh out the 'RETURN VALUES' section in sleep.9 and add an 'ERRORS'
section.
- Add __nonnull(1) to _sleep() and msleep_spin() so that the compiler will
warn if you try to pass a NULL wait channel. The functions already have
a KASSERT to that effect.
These functions are intended to be used to drop a lock and then reacquire
it when doing an sleep such as msleep(9). Both functions accept a
'struct lock_object *' as their first parameter. The 'lc_unlock' function
returns an integer that is then passed as the second paramter to the
subsequent 'lc_lock' function. This can be used to communicate state.
For example, sx locks and rwlocks use this to indicate if the lock was
share/read locked vs exclusive/write locked.
Currently, spin mutexes and lockmgr locks do not provide working lc_lock
and lc_unlock functions.
GETATTRs being generated - one from lookup()/namei() and the other
from nfs_open() (for cto consistency). This change eliminates the
GETATTR in nfs_open() if an otw GETATTR was done from the namei()
path. Instead of extending the vop interface, we timestamp each attr
load, and use this to detect whether a GETATTR was done from namei()
for this syscall. Introduces a thread-local variable that counts the
syscalls made by the thread and uses <pid, tid, thread syscalls> as
the attrload timestamp. Thanks to jhb@ and peter@ for a discussion on
thread state that could be used as the timestamp with minimal overhead.
a thread is an idle thread, just see if it has the IDLETD
flag set. That flag will probably move to the pflags word
as it's permenent and never chenges for the life of the
system so it doesn't need locking.
- Properly note when a read lock is released.
- Always note when we contest on a read lock.
- Only note success of obtaining read locks for the first reader to match
the behavior of sx(9).
Reviewed by: kmacy
- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.
Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.
this patch the code behaves according to the comment on the line above.
Without this patch, a socket could cause SIGPIPE to be delivered to its
process, once with SO_NOSIGPIPE set, and twice without.
With this patch, the kernel now passes the sigpipe regression test.
Tested by: Anton Yuzhaninov
MFC after: 1 week
and optimize away unused stack values. The 48 bytes that the lock_profile_object
adds to the stack evidently has a measurable performance impact on certain workloads.
uipc_send in cases where only a global read lock is held by breaking
them out and avoiding the unpcb lock acquire in the common case. This
avoids deadlocks which manifested with X11, and should also marginally
further improve performance.
Reported by: sepotvin, brooks
- Fix missing initialization in kern_rwlock.c causing bogus times to be collected
- Move updates to the lock hash to after the lock is released for spin mutexes,
sleep mutexes, and sx locks
- Add new kernel build option LOCK_PROFILE_FAST - only update lock profiling
statistics when an acquisition is contended. This reduces the overhead of
LOCK_PROFILING to increasing system time by 20%-25% which on
"make -j8 kernel-toolchain" on a dual woodcrest is unmeasurable in terms
of wall-clock time. Contrast this to enabling lock profiling without
LOCK_PROFILE_FAST and I see a 5x-6x slowdown in wall-clock time.
concurrency:
- Add per-unpcb mutexes protecting unpcb connection state, fields, etc.
- Replace global UNP mutex with a global UNP rwlock, which will protect the
UNIX domain socket connection topology, v_socket, and be acquired
exclusively before acquiring more than per-unpcb at a time in order to
avoid lock order issues.
In performance measurements involving MySQL, this change has little or no
overhead on UP (+/- 1%), but leads to a significant (5%-30%) improvement in
multi-processor measurements using the sysbench and supersmack benchmarks.
Much testing by: kris
Approved by: re (kensmith)
determine if it holds an exclusive rwlock reference or not. This is
non-ideal, but recursion scenarios in the network stack currently
require it.
Approved by: jhb
call which can easily lock up a system otherwise; instead,
return ENOBUFS as documented in a manpage, thus reverting
us to the FreeBSD 4.x behavior.
Reviewed by: rwatson
MFC after: 2 weeks
- only collect timestamps when a lock is contested - this reduces the overhead
of collecting profiles from 20x to 5x
- remove unused function from subr_lock.c
- generalize cnt_hold and cnt_lock statistics to be kept for all locks
- NOTE: rwlock profiling generates invalid statistics (and most likely always has)
someone familiar with that should review
PRIO_USER case, possibly also other places that deferences
p_ucred.
In the past, we insert a new process into the allproc list right
after PID allocation, and release the allproc_lock sx. Because
most content in new proc's structure is not yet initialized,
this could lead to undefined result if we do not handle PRS_NEW
with care.
The problem with PRS_NEW state is that it does not provide fine
grained information about how much initialization is done for a
new process. By defination, after PRIO_USER setpriority(), all
processes that belongs to given user should have their nice value
set to the specified value. Therefore, if p_{start,end}copy
section was done for a PRS_NEW process, we can not safely ignore
it because p_nice is in this area. On the other hand, we should
be careful on PRS_NEW processes because we do not allow non-root
users to lower their nice values, and without a successful copy
of the copy section, we can get stale values that is inherted
from the uninitialized area of the process structure.
This commit tries to close the race condition by grabbing proc
mutex *before* we release allproc_lock xlock, and do copy as
well as zero immediately after the allproc_lock xunlock. This
guarantees that the new process would have its p_copy and p_zero
sections, as well as user credential informaion initialized. In
getpriority() case, instead of grabbing PROC_LOCK for a PRS_NEW
process, we just skip the process in question, because it does
not affect the final result of the call, as the p_nice value
would be copied from its parent, and we will see it during
allproc traverse.
Other potential solutions are still under evaluation.
Discussed with: davidxu, jhb, rwatson
PR: kern/108071
MFC after: 2 weeks
freshly-loaded kernel module. To avoid various unload races, hide linker
files whose sysinit's are being run from userland so that they can't be
kldunloaded until after all the sysinit's have finished.
Tested by: gallatin
want an equivalent of DELAY(9) that sleeps instead of spins. It accepts
a wmesg and a timeout and is not interrupted by signals. It uses a private
wait channel that should never be woken up by wakeup(9) or wakeup_one(9).
Glanced at by: phk
check that the subject has read/write access to the vnode using the
vnode MAC check.
MFC after: 3 weeks
Submitted by: Spencer Minear <spencer_minear at securecomputing dot com>
Obtained from: TrustedBSD Project
garbage collection complications from general discussion of UNIX domain
sockets.
Staticize unp_addsockcred().
Remove XXX comment regarding Giant and v_socket -- v_socket is protected
by the global UNIX domain socket lock.
System V shared memory, now believed fixed in sysv_shm.c:1.109:
date: 2006/11/06 13:42:01; author: rwatson; state: Exp; lines: +65 -37
Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
This restores fine-grained privilege support to System V IPC.
PR: 106078
ignored on other systems I investigated when accessing an existing
memory segment rather than creating a new one. This call to ipcperm()
is the only one to pass in a complete mode flag to the permission
checks rather than a simple access request mask, and caused problems
for the revised ipcperm() based on the priv(9) interface, which can
now be restored.
PR: 106078
VFS privilege namespace: exceedquota, getquota, and setquota. Leave
UFS-specific quota configuration privileges in the UFS name space.
This renumbers VFS and UFS privileges, so requires rebuilding modules
if you are using security policies aware of privilege identifiers.
This is likely no one at this point since none of the committed MAC
policies use the privilege checks.
set real-time priority on a thread. It looks like this suser(9)
call was introduced after my first pass through replacing superuser
checks with named privilege checks.
As consequence, getdirentries() no longer needs to drop/reacquire
directory vnode lock, that would allow it to be reclaimed in between.
Reported and tested by: Peter Holm
Approved by: rodrigc (unionfs)
MFC after: 1 week
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.
Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.
VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.
Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs
a version that i posted earlier on the -current mailing list,
and subsequent feedback received.
The core of the change is just in sys/firmware.h and kern/subr_firmware.c,
while other files are just adaptation of the clients to the ABI change
(const-ification of some parameters and hiding of internal info,
so this is fully compatible at the binary level).
In detail:
- reduce the amount of information exported to clients in struct firmware,
and constify the pointer;
- internally, document and simplify the implementation of the various
functions, and make sure error conditions are dealt with properly.
The diffs are large, but the code is really straightforward now (i hope).
Note also that there is a subtle issue with the implementation of
firmware_register(): currently, as in the previous version, we just
store a reference to the 'imagename' argument, but we should rather
copy it because there is no guarantee that this is a static string.
I realised this while testing this code, but i prefer to fix it in
a later commit -- there is no regression with respect to the past.
Note, too, that the version in RELENG_6 has various bugs including
missing locks around the module release calls, mishandling of modules
loaded by /boot/loader, and so on, so an MFC is absolutely necessary
there. I was just postponing it until this cleanup to avoid doing
things twice.
MFC after: 1 week
of the special handling for ".." and perform an ISDOTDOT VOP_LOOKUP()
for a filesystem root vnode. Handle this case inside lookup().
Submitted by: tegge
PR: 92785
MFC after: 1 week
sonewconn() in unp_connect(). This avoids a race that occurs due to
v_socket being an uncounted reference, as the lock was being released in
order to call sonewconn(), which otherwise recurses into the UNIX domain
socket code via pru_attach, as well as holding the lock over a sleeping
memory allocation in uipc_attach(). Switch to a non-sleeping memory
allocation during UNIX domain socket attach.
This fix non-ideal in that it requires enabling recursion, but is a much
smaller change than moving to using true references for v_socket. The
reported panic occurs in unp_connect() following the return of
sonewconn().
Update copyright year.
Panic reported by: jhb
doing a CLEARFILE option. Do a vrele instead. This prevents
a panic later due to v_writecount being negative when the vnode
is taken off the freelist.
Submitted by: jhb
to become negative. This will detect the underflow when it
happens, instead of having it discovered when the vnode is
taken off the freelist, long after the offending process is long
gone.
- Fix these types in ULE as well. This fixes bugs in priority index
calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.
Reported by: kkenn using pho's stress2.
avoid holding the UNIX domain socket subsystem lock over soooptcopyin()
and sooptcopyout(). This problem was introduced when LOCAL_CREDS, and
LOCAL_CONNWAIT support were added.
Reviewed by: mdodd
sleep lock missed the witness code, and the system will panic
immediately on boot if WITNESS is enabled.
Changed the witness definition to the new type.
firmware in that module (eventhough this is a programming error) - drop the
reference to the module again.
Submitted by: Benjamin Close
MFC after: 3 days
minimize IPIs and rescheduling when scheduling like tasks while keeping
latency low for important threads.
1) An idle thread is running.
2) The current thread is worse than realtime and the new thread is
better than realtime. Realtime to realtime doesn't preempt.
3) The new thread's priority is less than the threshold.
support sched_4bsd.
- Rename the KTR level for non schedgraph parsed events. They take event
space from things we'd like to graph.
- Reset our slice value after we sleep. The slice is simply there to
prevent starvation among equal priorities. A thread which had almost
exhausted it's slice and then slept doesn't need to be rescheduled a
tick after it wakes up.
- Set the maximum slice value to a more conservative 100ms now that it is
more accurately enforced.
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.
Reviewed by: andre@
negative. Use unsigned integers for sleep and run time so this doesn't
disturb sched_interact_score(). This should fix the invalid interactive
priority panics reported by several users.