UNIX domain socket garbage collection implementation, as that risks
holding the mutex over potentially sleeping operations (as well as
introducing some nasty lock order issues, etc). unp_gc() will hold
the lock long enough to do necessary deferal checks and set that it's
running, but then release it until it needs to reset the gc state.
RELENG_5 candidate.
Discussed with: alfred
buffers with kqueue filters is no longer required: the kqueue framework
will guarantee that the mutex is held on entering the filter, either
due to a call from the socket code already holding the mutex, or by
explicitly acquiring it. This removes the last of the conditional
socket locking.
We were obtaining different spin mutexes (which disable interrupts after
aquisition) and spin waiting for delivery. For example, KSE processes
do LDT operations which use smp_rendezvous, while other parts of the
system are doing things like tlb shootdowns with a different mutex.
This patch uses the common smp_rendezvous mutex for all MD home-grown
IPIs that spinwait for delivery. Having the single mutex means that
the spinloop to aquire it will enable interrupts periodically, thus
avoiding the cross-ipi deadlock.
Obtained from: dwhite, alc
Reviewed by: jhb
in the shutdown_final state if the RB_NOSYNC flag is set.
The specific motivation in this case is that a system panic in an
interrupt context results in a call to module_shutdown(), which
calls g_modevent(), which calls g_malloc(..., M_WAITOK), which
results in a second panic. While g_modevent() could be fixed to
not call malloc() for MOD_SHUTDOWN events (which it doesn't handle
in any case), it is probably also a good idea to entirely skip the
execution of the module shutdown handlers after a panic.
This may be a MFC candidate for RELENG_5.
shutdown_pre_sync state if the RB_NOSYNC flag is set. This is the
likely cause of hangs after a system panic that are keeping crash
dumps from being done.
This is a MFC candidate for RELENG_5.
MFC after: 3 days
sockets are connection-oriented for the purposes of kqueue
registration. Since UDP sockets aren't connection-oriented, this
appeared to break a great many things, such as RPC-based
applications and services (i.e., NFS). Since jmg isn't around I'm
backing this out before too many more feet are shot, but intend to
investigate the right solution with him once he's available.
Apologies to: jmg
Discussed with: imp, scottl
is an effective band-aid for at least some of the scheduler corruption seen
recently. The real fix will involve protecting threads while they are
inconsistent, and will come later.
Submitted by: julian
If the bioq is empty, NULL is returned. Otherwise the front element
is removed and returned.
This can simplify locking in many drivers from:
lock()
bp = bioq_first(bq);
if (bp == NULL) {
unlock()
return
}
bioq_remove(bp, bq)
unlock
to:
lock()
bp = bioq_takefirst(bq);
unlock()
if (bp == NULL)
return;
have been unified with that of msleep(9), further refine the sleepq
interface and consolidate some duplicated code:
- Move the pre-sleep checks for theaded processes into a
thread_sleep_check() function in kern_thread.c.
- Move all handling of TDF_SINTR to be internal to subr_sleepqueue.c.
Specifically, if a thread is awakened by something other than a signal
while checking for signals before going to sleep, clear TDF_SINTR in
sleepq_catch_signals(). This removes a sched_lock lock/unlock combo in
that edge case during an interruptible sleep. Also, fix
sleepq_check_signals() to properly handle the condition if TDF_SINTR is
clear rather than requiring the callers of the sleepq API to notice
this edge case and call a non-_sig variant of sleepq_wait().
- Clarify the flags arguments to sleepq_add(), sleepq_signal() and
sleepq_broadcast() by creating an explicit submask for sleepq types.
Also, add an explicit SLEEPQ_MSLEEP type rather than a magic number of
0. Also, add a SLEEPQ_INTERRUPTIBLE flag for use with sleepq_add() and
move the setting of TDF_SINTR to sleepq_add() if this flag is set rather
than sleepq_catch_signals(). Note that it is the caller's responsibility
to ensure that sleepq_catch_signals() is called if and only if this flag
is passed to the preceeding sleepq_add(). Note that this also removes a
sched_lock lock/unlock pair from sleepq_catch_signals(). It also ensures
that for an interruptible sleep, TDF_SINTR is always set when
TD_ON_SLEEPQ() is true.
lock is not held.
Rather than annotating that the lock is released after calls to
unp_detach() with a comment, annotate with an assertion.
Assert that the UNIX domain socket subsystem lock is not held when
unp_externalize() and unp_internalize() are called.
and can lead to two threads being granted exclusive access. Check that no one
has the same lock in exclusive mode before proceeding to acquire it.
The LK_WANT_EXCL and LK_WANT_UPGRADE bits act as mini-locks and can block
other threads. Normally this is not a problem since the mini locks are
upgraded to full locks and the release of the locks will unblock the other
threads. However if a thread reset the bits without obtaining a full lock
other threads are not awoken. Add missing wakeups for these cases.
PR: kern/69964
Submitted by: Stephan Uphoff <ups at tree dot com>
Very good catch by: Stephan Uphoff <ups at tree dot com>
before dereferencing sotounpcb() and checking its value, as so_pcb
is protected by protocol locking, not subsystem locking. This
prevents races during close() by one thread and use of ths socket
in another.
unp_bind() now assert the UNP lock, and uipc_bind() now acquires
the lock around calls to unp_bind().
- pipespace is now able to resize non-empty pipes; this allows
for many more resizing opportunities
- Backing is no longer pre-allocated for the reverse direction
of pipes. This direction is rarely (if ever) used, so this cuts the
amount of map space allocated to a pipe in half.
- Pipe growth is now much more dynamic; a pipe will now grow when
the total amount of data it contains and the size of the write are
larger than the size of pipe. Previously, only individual writes greater
than the size of the pipe would cause growth.
- In low memory situations, pipes will now shrink during both read
and write operations, where possible. Once the memory shortage
ends, the growth code will cause these pipes to grow back to an appropriate
size.
- If the full PIPE_SIZE allocation fails when a new pipe is created, the
allocation will be retried with SMALL_PIPE_SIZE. This helps to deal
with the situation of a fragmented map after a low memory period has
ended.
- Minor documentation + code changes to support the above.
In total, these changes increase the total number of pipes that
can be allocated simultaneously, drastically reducing the chances that
pipe allocation will fail.
Performance appears unchanged due to dynamic resizing.
Don't count busy buffers before the initial call to sync() and
don't skip the initial sync() if no busy buffers were called.
Always call sync() at least once if syncing is requested. This
defers the "Syncing disks, buffers remaining..." message until
after the initial sync() call and the first count of busy
buffers. This backs out changes in kern_shutdown 1.162.
Print a different message when there are no busy buffers after the
initial sync(), which is now the expected situation.
Print an additional message when syncing has completed successfully
in the unusual situation where the work of syncing was done by
boot().
Uppercase one message to make it consistent with all of the other
kernel shutdown messages.
Discussed with: bde (in a much earlier form, prior to 1.162)
Reviewed by: njl (in an earlier form)
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.
Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).
Reviewed by: green, rwatson (both earlier versions)
attempt to IPI other cpus when entering the debugger in order to stop
them while in the debugger. The default remains to issue the stop;
however, that can result in a hang if another cpu has interrupts disabled
and is spinning, since the IPI won't be received and the KDB will wait
indefinitely. We probably need to add a timeout, but this is a useful
stopgap in the mean time.
Reviewed by: marcel
threads consuming the result of pfind() will not need to check for a NULL
credential pointer or other signs of an incompletely created process.
However, this also means that pfind() cannot be used to test for the
existence or find such a process. Annotate pfind() to indicate that this
is the case. A review of curent consumers seems to indicate that this is
not a problem for any of them. This closes a number of race conditions
that could result in NULL pointer dereferences and related failure modes.
Other related races continue to exist, especially during iteration of the
allproc list without due caution.
Discussed with: tjr, green
connect to, re-check that the local UNIX domain socket hasn't been
closed while we slept, and if so, return EINVAL. This affects the
system running both with and without Giant over the network stack,
and recent ULE changes appear to cause it to trigger more
frequently than previously under load. While here, improve catching
of possibly closed UNIX domain sockets in one or two additional
circumstances. I have a much larger set of related changes in
Perforce, but they require more testing before they can be merged.
One debugging printf is left in place to indicate when such a race
takes place: this is typically triggered by a buggy application
that simultaenously connect()'s and close()'s a UNIX domain socket
file descriptor. I'll remove this at some point in the future, but
am interested in seeing how frequently this is reported. In the
case of Martin's reported problem, it appears to be a result of a
non-thread safe syslog() implementation in the C library, which
does not synchronize access to its logging file descriptor.
Reported by: mbr
migration. Use this in sched_prio() and sched_switch() to stop us from
migrating threads that are in short term sleeps or are runnable. These
extra migrations were added in the patches to support KSE.
- Only set NEEDRESCHED if the thread we're adding in sched_add() is a
lower priority and is being placed on the current queue.
- Fix some minor whitespace problems.
to allow dumping per-thread machine specific notes. On ia64 we use this
function to flush the dirty registers onto the backingstore before we
write out the PRSTATUS notes.
Tested on: alpha, amd64, i386, ia64 & sparc64
Not tested on: arm, powerpc
we may sleep when doing so; check that we didn't race with another thread
allocating storage for the vnode after allocation is made to a local
pointer, and only update the vnode pointer if it's still NULL. Otherwise,
accept that another thread got there first, and release the local storage.
Discussed with: jmg
contributed to the transferable load count. This prevents any potential
problems with sched_pin() being used around calls to setrunqueue().
- Change the sched_add() load balancing algorithm to try to migrate on
wakeup. This attempts to place threads that communicate with each other
on the same CPU.
- Don't clear the idle counts in kseq_transfer(), let the cpus do that when
they call sched_add() from kseq_assign().
- Correct a few out of date comments.
- Make sure the ke_cpu field is correct when we preempt.
- Call kseq_assign() from sched_clock() to catch any assignments that were
done without IPI. Presently all assignments are done with an IPI, but I'm
trying a patch that limits that.
- Don't migrate a thread if it is still runnable in sched_add(). Previously,
this could only happen for KSE threads, but due to changes to
sched_switch() all threads went through this path.
- Remove some code that was added with preemption but is not necessary.