dereference curthread. It is called only from critical_{enter,exit}(),
which already dereferences curthread. This doesn't seem to affect SMP
performance in my benchmarks, but improves MySQL transaction throughput
by about 1% on UP on my Xeon.
Head nodding: jhb, bmilekic
an adaptive fashion when adaptive mutexes are enabled. The theory
behind non-adaptive Giant is that Giant will be held for long periods
of time, and therefore spinning waiting on it is wasteful. However,
in MySQL benchmarks which are relatively Giant-free, running Giant
adaptive makes an observable difference on SMP (5% transaction rate
improvement). As such, make adaptive behavior on Giant an option so
it can be more widely benchmarked.
- Push down Giant into shmexit(). (Giant is acquired only if the vmspace
contains shm segments.)
- Eliminate the acquisition of Giant from proc_rwmem().
- Reduce the scope of Giant in exit1(), uncovering the destruction of the
address space.
switch in fork_exit() to before anything else is done (but keep
schedlock for the deadthread check). This means one less
nasty bug if ever in the future whatever might have been called
before the update played with schedlock or critical sections.
Discussed with: tjr
the system" resource limit code: When checking if the caller has superuser
privileges, we should be checking the *real* user, not the *effective*
user. (In general, resource limiting is done based on the real user, in
order to avoid resource-exhaustion-by-setuid-program attacks.)
Now that a SUSER_RUID flag to suser_cred exists, use it here to return
this code to its correct behaviour.
Pointed out by: rwatson
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
the license from /usr/src/COPYRIGHT. Since cvs annotate shows that
this was written by jasone, julian, jhb, peter, bmilekic and obrien.
cvs log shows that many others may have contributed to this file. As
such, go ahead and use the author of 'FreeBSD Project' for this file.
If this is a problem, please notify me.
# this eliminates the last file in the kernel with an indirect reference
# to /usr/src/COPYRIGHT in the kernel. A few more in userland remain.
can't yet be referenced by other threads.
In microbenchmarks, this appears to reduce the cost of
pipe();close();close() on UP by 10%, and SMP by 7%. The vast majority
of the cost of allocating a pipe remains VM magic.
Suggested by: silby
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op. Don't acquire Giant in fstat() system call.
Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.
individual file object implementations can optionally acquire Giant if
they require it:
- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)
Notes:
Giant is still acquired in close() even when closing MPSAFE objects
due to kqueue requiring Giant in the calling closef() code.
Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
of pipe create/destroy pairs from user space with SMP compiled into
the kernel.
The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
test it extensively and so have left Giant over fo_close(). It can
probably be removed given some testing and review.
thread-local pointer, in practice that thread needs to be curthread. If
we're running with INVARIANTS, generate a warning if not. If we have
KDB compiled in, generate a stack trace. This doesn't fire at all in my
local test environment, but could be irritating if it fires frequently
for someone, so there will be motivation to fix things quickly when it
does.
(WITNESS) for code paths that always call uma_zalloc_arg() shortly
after where the check was, because uma_zalloc_arg() already does
a similar check.
No objections from Alfred. Thanks Alfred.
work very infrequently, and often results in a compound panic which
confuses debugging; locking/SMP have made the layering violation (and
risks) of this more obvious over time.
Discussed with: green, bde, et al.
to not get a page fault if he has not defined a dump device.
Panic can often not do a dump as it can hang forever in some cases.
The original PR was for amd64 only. This is a generalised version of
that change.
PR: amd64/67712
Submitted by: wjw@withagen.nl <Willen Jan Withagen>
improved chance of working despite pressure from running programs.
Instead of trying to throw a bunch of pages out to swap and hope for
the best, only a range that can potentially fulfill contigmalloc(9)'s
request will have its contents paged out (potentially, not forcibly)
at a time.
The new contigmalloc operation still operates in three passes, but it
could potentially be tuned to more or less. The first pass only looks
at pages in the cache and free pages, so they would be thrown out
without having to block. If this is not enough, the subsequent passes
page out any unwired memory. To combat memory pressure refragmenting
the section of memory being laundered, each page is removed from the
systems' free memory queue once it has been freed so that blocking
later doesn't cause the memory laundered so far to get reallocated.
The page-out operations are now blocking, as it would make little sense
to try to push out a page, then get its status immediately afterward
to remove it from the available free pages queue, if it's unlikely to
have been freed. Another change is that if KVA allocation fails, the
allocated memory segment will be freed and not leaked.
There is a sysctl/tunable, defaulting to on, which causes the old
contigmalloc() algorithm to be used. Nonetheless, I have been using
vm.old_contigmalloc=0 for over a month. It is safe to switch at
run-time to see the difference it makes.
A new interface has been used which does not require mapping the
allocated pages into KVA: vm_page.h functions vm_page_alloc_contig()
and vm_page_release_contig(). These are what vm.old_contigmalloc=0
uses internally, so the sysctl/tunable does not affect their operation.
When using the contigmalloc(9) and contigfree(9) interfaces, memory
is now tracked with malloc(9) stats. Several functions have been
exported from kern_malloc.c to allow other subsystems to use these
statistics, as well. This invalidates the BUGS section of the
contigmalloc(9) manpage.
specify "us" as the thread not the process/ksegrp/kse.
You can always find the others from the thread but the converse is not true.
Theorotically this would lead to runtime being allocated to the wrong
entity in some cases though it is not clear how often this actually happenned.
(would only affect threaded processes and would probably be pretty benign,
but it WAS a bug..)
Reviewed by: peter
time now to break with the past: do not write the PID in the first note.
Rationale:
1. [impact of the breakage] Process IDs in core files serve no immediate
purpose to the debugger itself. They are only useful to relate a core
file to a process. This can provide context to the person looking at
the core file, provided one keeps track of this. Overall, not having
the PID in the core file is only in very rare occasions unfortunate.
2. [reason of the breakage] Having one PRSTATUS note contain the PID,
while all others contain the LWPID of the corresponding kernel thread
creates an irregularity for the debugger that cannot easily be worked
around. This is caused by libthread_db correlating user thread IDs to
kernel thread (aka LWP) IDs and thus aware of the actual LWPIDs.
Update comments accordingly.
that get certain types of control messages (ping6 and rtsol are
examples). This gets the new code closer to working:
1) Collect control mbufs for processing in the controlp ==
NULL case, so that they can be freed by externalize.
2) Loop over the list of control mbufs, as the externalize
function may not know how to deal with chains.
3) In the case where there is no externalize function,
remember to add the control mbuf to the controlp list so
that it will be returned.
4) After adding stuff to the controlp list, walk to the
end of the list of stuff that was added, incase we added
a chain.
This code can be further improved, but this is enough to get most
things working again.
Reviewed by: rwatson
NO_ADAPTIVE_MUTEXES. This option has been enabled by default on amd64 for
quite some time, and has been extensively tested on i386 and sparc64. It
shows measurable performance gains in many circumstances, and few negative
effects. It would be nice in t he future if adaptive mutexes actually went
to sleep after a certain amount of spinning, but that will require quite a
bit more testing.
earlier in unp_connect() so that vp->v_socket can't change between
our copying its value to a local variable and later use of that
variable. This may have been responsible for a panic during
shutdown that I experienced where simultaneous closing of a listen
socket by rpcbind and a new connection being made to rpcbind by
mountd.
values from either user land or from the kernel. Use them for
[gs]etsockopt and to clean up some calls to [gs]etsockopt in the
Linux emulation code that uses the stackgap.
since they are only accessed by curthread and thus do not need any
locking.
- Move pr_addr and pr_ticks out of struct uprof (which is per-process)
and directly into struct thread as td_profil_addr and td_profil_ticks
as these variables are really per-thread. (They are used to defer an
addupc_intr() that was too "hard" until ast()).
future:
rename ttyopen() -> tty_open() and ttyclose() -> tty_close().
We need the ttyopen() and ttyclose() for the new generic cdevsw
functions for tty devices in order to have consistent naming.
for unknown events.
A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
line as the announcement. Someone should probably update the "buffers
remaining" message since we now no longer should have any buffers remaining
at that point.
Until now the function has just returned zero for any event, but
that is downright wrong for MOD_UNLOAD and not very useful for any
future events we add where it may be crucial to be able to tell
if the event was unhandled or successful.
Change the function to return as follows:
MOD_LOAD -> 0
MOD_UNLOAD -> EBUSY
anything else -> EOPNOTSUPP
check to ensure that the caller is not prison root.
The intention is to fix file descriptor creation so that
prison root can not use the last remaining file descriptors.
This privilege should be reserved for non-jailed root users.
Approved by: bmilekic (mentor)
sched_add() rather than just doing it in sched_wakeup(). The old
ithread preemption code used to set NEEDRESCHED unconditionally if it
didn't preempt which masked this bug in SCHED_4BSD.
Noticed by: jake
Reported by: kensmith, marcel
Add a MOD_QUIESCE event for modules. This should return error (EBUSY)
of the module is in use.
MOD_UNLOAD should now only fail if it is impossible (as opposed to
inconvenient) to unload the module. Valid reasons are memory references
into the module which cannot be tracked down and eliminated.
When kldunloading, we abandon if MOD_UNLOAD fails, and if -force is
not given, MOD_QUIESCE failing will also prevent the unload.
For backwards compatibility, we treat EOPNOTSUPP from MOD_QUIESCE as
success.
Document that modules should return EOPNOTSUPP for unknown events.
1. Add tm_lwpid into kse_thr_mailbox to indicate which kernel
thread current user thread is running on. Add tm_dflags into
kse_thr_mailbox, the flags is written by debugger, it tells
UTS and kernel what should be done when the process is being
debugged, current, there two flags TMDF_SSTEP and TMDF_DONOTRUNUSER.
TMDF_SSTEP is used to tell kernel to turn on single stepping,
or turn off if it is not set.
TMDF_DONOTRUNUSER is used to tell kernel to schedule upcall
whenever possible, to UTS, it means do not run the user thread
until debugger clears it, this behaviour is necessary because
gdb wants to resume only one thread when the thread's pc is
at a breakpoint, and thread needs to go forward, in order to
avoid other threads sneak pass the breakpoints, it needs to remove
breakpoint, only wants one thread to go. Also, add km_lwp to
kse_mailbox, the lwp id is copied to kse_thr_mailbox at context
switch time when process is not being debugged, so when process
is attached, debugger can map kernel thread to user thread.
2. Add p_xthread to proc strcuture and td_xsig to thread structure.
p_xthread is used by a thread when it wants to report event
to debugger, every thread can set the pointer, especially, when
it is used in ptracestop, it is the last thread reporting event
will win the race. Every thread has a td_xsig to exchange signal
with debugger, thread uses TDF_XSIG flag to indicate it is reporting
signal to debugger, if the flag is not cleared, thread will keep
retrying until it is cleared by debugger, p_xthread may be
used by debugger to indicate CURRENT thread. The p_xstat is still
in proc structure to keep wait() to work, in future, we may
just use td_xsig.
3. Add TDF_DBSUSPEND flag, the flag is used by debugger to suspend
a thread. When process stops, debugger can set the flag for
thread, thread will check the flag in thread_suspend_check,
enters a loop, unless it is cleared by debugger, process is
detached or process is existing. The flag is also checked in
ptracestop, so debugger can temporarily suspend a thread even
if the thread wants to exchange signal.
4. Current, in ptrace, we always resume all threads, but if a thread
has already a TDF_DBSUSPEND flag set by debugger, it won't run.
Encouraged by: marcel, julian, deischen
1. Add tm_lwpid into kse_thr_mailbox to indicate which kernel
thread current user thread is running on. Add tm_dflags into
kse_thr_mailbox, the flags is written by debugger, it tells
UTS and kernel what should be done when the process is being
debugged, current, there two flags TMDF_SSTEP and TMDF_DONOTRUNUSER.
TMDF_SSTEP is used to tell kernel to turn on single stepping,
or turn off if it is not set.
TMDF_DONOTRUNUSER is used to tell kernel to schedule upcall
whenever possible, to UTS, it means do not run the user thread
until debugger clears it, this behaviour is necessary because
gdb wants to resume only one thread when the thread's pc is
at a breakpoint, and thread needs to go forward, in order to
avoid other threads sneak pass the breakpoints, it needs to remove
breakpoint, only wants one thread to go. Also, add km_lwp to
kse_mailbox, the lwp id is copied to kse_thr_mailbox at context
switch time when process is not being debugged, so when process
is attached, debugger can map kernel thread to user thread.
2. Add p_xthread to proc strcuture and td_xsig to thread structure.
p_xthread is used by a thread when it wants to report event
to debugger, every thread can set the pointer, especially, when
it is used in ptracestop, it is the last thread reporting event
will win the race. Every thread has a td_xsig to exchange signal
with debugger, thread uses TDF_XSIG flag to indicate it is reporting
signal to debugger, if the flag is not cleared, thread will keep
retrying until it is cleared by debugger, p_xthread may be
used by debugger to indicate CURRENT thread. The p_xstat is still
in proc structure to keep wait() to work, in future, we may
just use td_xsig.
3. Add TDF_DBSUSPEND flag, the flag is used by debugger to suspend
a thread. When process stops, debugger can set the flag for
thread, thread will check the flag in thread_suspend_check,
enters a loop, unless it is cleared by debugger, process is
detached or process is existing. The flag is also checked in
ptracestop, so debugger can temporarily suspend a thread even
if the thread wants to exchange signal.
4. Current, in ptrace, we always resume all threads, but if a thread
has already a TDF_DBSUSPEND flag set by debugger, it won't run.
Encouraged by: marcel, julian, deischen
pmap_remove_pages(). (The implementation of pmap_remove_pages() is
optional. If pmap_remove_pages() is unimplemented, the acquisition and
release of the page queues lock is unnecessary.)
Remove spl calls from the alpha, arm, and ia64 pmap_remove_pages().
a better name. I have a kern_[sg]etsockopt which I plan to commit
shortly, but the arguments to these function will be quite different
from so_setsockopt.
Approved by: alfred
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
will be used heavily in debugging KSE threads. This breaks libpthread
on IA64, but because libpthread was not in 5.2.1 release, I would like
to change it so we needn't to introduce another syscall.
tracing process to obtain information about the LWP that caused the
traced process to stop. Debuggers can use this information to select
the thread currently running on the LWP as the current thread.
The request has been made compatible with NetBSD for as much as
possible. This implementation differs from NetBSD in the following
ways:
1. The data argument is allowed to be smaller than the size of the
ptrace_lwpinfo structure known to the kernel, but not 0. This
is opposite to what NetBSD allows. The reason for this is that
we can extend the structure without affecting older binaries.
2. On NetBSD the tracing process is to set the pl_lwpid field to
the Id of the LWP it wants information of. We don't do that.
Our ptrace interface allows passing the LWP Id instead of the
PID. The tracing process is to set the PID to the LWP Id it
wants information of.
3. When the PID is actually the PID of the tracing process, this
request returns the information about the LWP that caused the
process to stop. This was the whole purpose of the request in
the first place.
When the traced process has exited, this request will return the
LWP Id 0, indicating that the process state is not the result of
an event specific to a LWP.
changing the backend from outside the KDB frontend. For example from
within a backend. Rewrite kdb_sysctl_current to make use of this
function as well.
in soreceive() after removing an MT_SONAME mbuf from the head of the
socket buffer.
When processing MT_CONTROL mbufs in soreceive(), first remove all of
the MT_CONTROL mbufs from the head of the socket buffer to a local
mbuf chain, then feed them into dom_externalize() as a set, which
both avoids thrashing the socket buffer lock when handling multiple
control mbufs, and also avoids races with other threads acting on
the socket buffer when the socket buffer mutex is released to enter
the externalize code. Existing races that might occur if the protocol
externalize method blocked during processing have also been closed.
Now that we synchronize socket buffer and stack state following
modifications to the socket buffer, turn the manual synchronization
that previously followed control mbuf processing with a set of
assertions. This can eventually be removed.
The soreceive() code is now substantially more MPSAFE.
the head of the mbuf chains in a socket buffer, re-synchronizes the
cache pointers used to optimize socket buffer appends. This will be
used by soreceive() before dropping socket buffer mutexes to make sure
a consistent version of the socket buffer is visible to other threads.
While here, update copyright to account for substantial rewrite of much
socket code required for fine-grained locking.
locking on 'nextrecord' and concerns regarding potentially inconsistent
or stale use of socket buffer or stack fields if they aren't carefully
synchronized whenever the socket buffer mutex is released. Document
that the high-level sblock() prevents races against other readers on
the socket.
Also document the 'type' logic as to how soreceive() guarantees that
it will only return one of normal data or inline out-of-band data.
name in the debug.kdb.current sysctl. All other dereferences are
properly guarded, but this one was overlooked.
Reported by: Morten Rodal (morten at rodal dot no)
associated with a PR_ADDR protocol, make sure to update the m_nextpkt
pointer of the new head mbuf on the chain to point to the next record.
Otherwise, when we release the socket buffer mutex, the socket buffer
mbuf chain may be in an inconsistent state.
o Make debugging code conditional upon KDB instead of DDB.
o s/WITNESS_DDB/WITNESS_KDB/g
o s/witness_ddb/witness_kdb/g
o Rename the debug.witness_ddb sysctl to debug.witness_kdb.
o Call kdb_backtrace() instead of backtrace().
o Call kdb_enter() instead Debugger().
o Assert kdb_active instead of db_active.
o Make debugging code conditional upon KDB instead of DDB.
o Call kdb_enter() instead of Debugger().
o Call kdb_backtrace() instead of db_print_backtrace() or backtrace().
kern_mutex.c:
o Replace checks for db_active with checks for kdb_active and make
them unconditional.
kern_shutdown.c:
o s/DDB_UNATTENDED/KDB_UNATTENDED/g
o s/DDB_TRACE/KDB_TRACE/g
o Save the TID of the thread doing the kernel dump so the debugger
knows which thread to select as the current when debugging the
kernel core file.
o Clear kdb_active instead of db_active and do so unconditionally.
o Remove backtrace() implementation.
kern_synch.c:
o Call kdb_reenter() instead of db_error().
in which multiple (presumably different) debugger backends can be
configured and which provides basic services to those backends.
Besides providing services to backends, it also serves as the single
point of contact for any and all code that wants to make use of the
debugger functions, such as entering the debugger or handling of the
alternate break sequence. For this purpose, the frontend has been
made non-optional.
All debugger requests are forwarded or handed over to the current
backend, if applicable. Selection of the current backend is done by
the debug.kdb.current sysctl. A list of configured backends can be
obtained with the debug.kdb.available sysctl. One can enter the
debugger by writing to the debug.kdb.enter sysctl.
Add copyiniov() which copies a struct iovec array in from userland into
a malloc'ed struct iovec. Caller frees.
Change uiofromiov() to malloc the uio (caller frees) and name it
copyinuio() which is more appropriate.
Add cloneuio() which returns a malloc'ed copy. Caller frees.
Use them throughout.
assigning a pointer to the list and then dereferencing the pointer as a
second step. When the first spin lock is acquired, curthread is not in
a critical section so it may be preempted and would end up using another
CPUs lock list instead of its own.
When this code was in witness_lock() this sequence was safe as curthread
was in a critical section already since witness_lock() is called after the
lock is acquired.
Tested by: Daniel Lang dl at leo.org
takes an argument to specify if it should preempt or not. Don't preempt
when sched_add_internal() is called from kseq_idled() or kseq_assign()
as in those cases we are about to call mi_switch() anyways. Also, doing
so during the first context switch on an AP leads to a NULL pointer deref
because curthread is NULL.
- Reenable preemption for ULE.
Submitted by: Taku YAMAMOTO taku at tackymt.homeip.net
When avoiding the zeroing of "bogus_page" when it appears in a buf,
be sure to advance the pointers into the data for successive pages.
The bug caused file corruption when read(2)ing from a "hole" in a
file where a previous page of the read block had already been faulted
in: fsx tripped up on this pretty quickly. The particular access
pattern is probably pretty unusual, so other applications probably
wouldn't have had problems, but you'd never know.
Reviewed By: alc@
Rebind the client socket when we experience a timeout. This fixes
the case where our IP changes for some reason.
Signal a VFS event when NFS transitions from up to down and vice
versa.
Add a placeholder vfs_sysctl where we will put status reporting
shortly.
Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.
hangs due to recent preemption changes. This change appears to remove
the panic that I was running into, but at the cost of increasing
ithread scheduling latency, and as such is a temporary band-aid until
jhb has a chance to resolve the ule<->preemption interaction that is
the source of the problem. If it doesn't fix the problem for others--
sorry!
so that last_work_seen has a reasonable value at the transition
to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened
to be zero at the time.
Initialize last_work_seen to zero as a safety measure in case the
syncer never ran in the SYNCER_RUNNING state.
Tested by: phk
Speed up the syncer when shutting down by sleeping for a shorter
period of time instead of cranking up rushjob and using the
normal one second sleep.
Skip empty worklist slots when shutting down to avoid lengthy
intervals of inactivity.
Give I/O more time to complete between steps by not speeding the
syncer quite as much.
Terminate the syncer after one full pass through the worklist
plus one second with the worklist containing nothing but syncer
vnodes.
Print an indication of shutdown progress to the console.
Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist
to be monitored.
around in the vnodes surroundings when we allocate a block.
Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.
Please email all such warnings to me.
generic filesystem events to userspace. Currently only mount and unmount
of filesystems are signalled. Soon to be added, up/down status of NFS.
Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.
Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.
ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) ->
insmntqueue(vp, mp = NULL) -> KASSERT -> panic
Make getnewvnode() only call insmntqueue() if the mountpoint parameter
is not NULL.
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.
Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:
MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);
The code which takes vnodes off a mountpoint looks like this:
MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;
(Take a moment and try to spot the locking error before you read on.)
On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.
Fix:
Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.
Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.
Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
introduced a KSE_CAN_MIGRATE() invocation with one argument
missing (class). Either this is a genuine forget or it crept
in from JHB's repo where he may have modified it. If it's
the latter then it may require more attention. For now fix
the make depend.
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
determine if a thread about to be added to a run queue should be
preempted to directly. If it is not safe to preempt or if the new
thread does not have a high enough priority, then the function returns
false and sched_add() adds the thread to the run queue. If the thread
should be preempted to but the current thread is in a nested critical
section, then the flag TDF_OWEPREEMPT is set and the thread is added
to the run queue. Otherwise, mi_switch() is called immediately and the
thread is never added to the run queue since it is switch to directly.
When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
setrunqueue() now does all the correct work. This also removes the
do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
chance to run if the architecture supports native preemption since
the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
preemption, namely alpha, i386, and amd64.
This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.
Approved by: scottl (with his re@ hat)
switch to. If a non-NULL thread pointer is passed in, then the CPU will
switch to that thread directly rather than calling choosethread() to pick
a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
requiring all callers of mi_switch() to know to do this if they can be
called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
the middle of the function prototypes and up above into their own
section.
lock and sched_lock so they can be read with either lock held. Document
the locking as well. The one remaining bogosity is that pr_addr and
pr_ticks should be per-thread but profiling of multithreaded apps is
currently undefined.
stats structure and a vmspace as this should always be true rather
than checking the always true condition in an if statement.
- Remove never-false check: if ((ru = &pstats->p_ru) != NULL)
- Remove pstats variable that is only used once and inline its one use
instead.
pointer to the corresponding struct thread to the thread ID (lwpid_t)
assigned to that thread. The primary reason for this change is that
libthr now internally uses the same ID as the debugger and the kernel
when referencing to a kernel thread. This allows us to implement the
support for debugging without additional translations and/or mappings.
To preserve the ABI, the 1:1 threading syscalls, including the umtx
locking API have not been changed to work on a lwpid_t. Instead the
1:1 threading syscalls operate on long and the umtx locking API has
not been changed except for the contested bit. Previously this was
the least significant bit. Now it's the most significant bit. Since
the contested bit should not be tested by userland, this change is
not expected to be visible. Just to be sure, UMTX_CONTESTED has been
removed from <sys/umtx.h>.
Reviewed by: mtm@
ABI preservation tested on: i386, ia64
faster and iterate to over its work list a few times in an attempt
to empty the work list before the syncer terminates. This leaves
fewer dirty blocks to be written at the "syncing disks" stage and
keeps the the "giving up on N buffers" problem from being triggered
by the presence of a large soft updates work list at system shutdown
time. The downside is that the syncer takes noticeably longer to
terminate.
Tested by: "Arjan van Leeuwen" <avleeuwen AT piwebs DOT com>
Approved by: mckusick
devremoved events. This reduces the races around these events. We
now include the pnp info in both. This lets one do more interesting
thigns with devd on device insertion.
Submitted by: Bernd Walter
creation of the sysctl tree for the turnstile profiling stats until a
SI_SUB_LOCK sysinit. Doing it in init_turnstiles() is too early as it is
called before mi_startup().
hash tables used in the sleep queue and turnstile code. Each option adds
a sysctl tree under debug containing the maximum depth of any bucket in
the hash table as well as a separate node for each bucket (or chain)
containing the current depth and maximum depth for that bucket.
the queue has been removed from the global taskqueue_queues list. This
removes the need for the draining queue hack.
- Allow taskqueue_run() to be called with the taskqueue mutex held. It
can still be called without the lock for API compatiblity. In that case
it will acquire the lock internally.
- Don't lock the individual queue mutex in taskqueue_find() until after the
strcmp as the global queues mutex is sufficient for the strcmp.
- Simplify taskqueue_thread_loop() now that it can hold the lock across
taskqueue_run().
Submitted by: bde (mostly)
so_gencnt, numopensockets, and the per-socket field so_gencnt. Annotate
this this might be better done with atomic operations.
Annotate what accept_mtx protects.
associated with performing a wakeup on the socket buffer:
- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
acquire the socket buffer lock and use the _locked() variants of both
calls. Note that the _locked() sowakeup() versions unlock the mutex on
return. This is done in uipc_send(), divert_packet(), mroute
socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().
- When the socket buffer lock is dropped before a sowakeup(), remove the
explicit unlock and use the _locked() sowakeup() variant. This is done
in soisdisconnecting(), soisdisconnected() when setting the can't send/
receive flags and dropping data, and in uipc_rcvd() which adjusting
back-pressure on the sockets.
For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.
The overhead of unconditionally allocating TIDs (and likewise,
unconditionally deallocating them), is amortized across multiple
thread creations by the way UMA makes it possible to have type-stable
storage.
Previously the cost was kept down by having threads created as part
of a fork operation use the process' PID as the TID. While this had
some nice properties, it also introduced complexity in the way TIDs
were allocated. Most importantly, by using the type-stable storage
that UMA gives us this was also unnecessary.
This change affects how core dumps are created and in particular how
the PRSTATUS notes are dumped. Since we don't have a thread with a
TID equalling the PID, we now need a different way to preserve the
old and previous behavior. We do this by having the given thread (i.e.
the thread passed to the core dump code in td) dump it's state first
and fill in pr_pid with the actual PID. All other threads will have
pr_pid contain their TIDs. The upshot of all this is that the debugger
will now likely select the right LWP (=TID) as the initial thread.
Credits to: julian@ for spotting how we can utilize UMA.
Thanks to: all who provided julian@ with test results.
copies.
No current line disciplines have a dynamically changing hotchar, and
expecting to receive anything sensible during a change in ldisc is
insane so no locking of the hotchar field is necessary.
we have to revert to TTYDISC which we know will successfully open
rather than try the previous ldisc which might also fail to open.
Do not let ldisc implementations muck about with ->t_line, and remove
code which checks for reopens, it should never happen.
Move ldisc->l_hotchar to tty->t_hotchar and have ldisc implementation
initialize it in their open routines. Reset to zero when we enter
TTYDISC. ("no" should really be -1 since zero could be a valid
hotchar for certain old european mainframe protocols.)
older API to list attributes on a file (zero-length attribute name)
to function. extattr_list_*() are now the only available APIs to
use when listing attributes.
the socket buffer having its limits adjusted. sbreserve() now acquires
the lock before calling sbreserve_locked(). In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order. In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.
the SS_NBIO flag from the parent socket to the child socket during an
accept() operation.
The file descriptor O_NONBLOCK flag would have been propagated already
by the fflag assignment, and therefore would have been inconsistent
with the underlying socket's so_state member.
This makes accept() more closely adhere to the API contract we effectively
outline in the manual page. Note also that Linux continues to differ here;
O_NONBLOCK is not propagated. The other BSDs do propagate the flag, as
does Solaris. The Single UNIX Specification does not offer specific
advice on this issue.
PR: kern/45733
Requested by: Jayanth Vijayaraghavan
Reviewed by: rwatson
* Obtain/release schedlock around calls to calcru.
* Sort switch cases which do not cascade per style(9).
* Sort local variables per style(9).
* Remove "superfluous" whitespace.
* Cleanup handling of NULL uap->tp in clock_getres(). It would probably
be better to return EFAULT like clock_gettime() does by passing the
pointer to copyout(), but I presume it was written to not fail on
purpose in the original code. I'll defer to -standards on this one.
Reported by: bde
This is not really used by the process but it's confusing to some
status readers to see zombie processes the "runnin" threads.
Pointed out by: Don Lewis <truckman@FreeBSD.org>
where it is known to detect a problem but the problem is not very easy
to fix. The warning became very common recently after a call to calcru()
was added to fill_kinfo_thread().
Another (much older) cause of "negative times" (actually non-monotonic
times) was fixed in rev.1.237 of kern_exit.c.
Print separate messages for non-monotonic and negative times.
from exit1(). sched_exit() must be called unconditionally from exit1().
It was called almost unconditionally because the only exits on system
shutdown if at all.
(2) Removed the comment that presumed to know what sched_exit() does.
sched_exit() does different things for the ULE case. The call became
essential when it started doing load average stuff, but its caller
should not know that.
(3) Didn't fix bugs caused by bitrot in the condition. The condition was
last correct in rev.1.208 when it was in wait1(). There p was spelled
curthread->td_proc and was for the waiting parent; now p is for the
exiting child. The condition was to avoid lowering init's priority.
It should be in sched_exit() itself. Lowering of priorities is broken
in other ways in at least the 4BSD scheduler, and doing it for init
causes less noticeable problems than doing it for for shells.
Noticed by: julian (1)
- sowakeup() now asserts the socket buffer lock on entry. Move
the call to KNOTE higher in sowakeup() so that it is made with
the socket buffer lock held for consistency with other calls.
Release the socket buffer lock prior to calling into pgsigio(),
so_upcall(), or aio_swake(). Locking for this event management
will need revisiting in the future, but this model avoids lock
order reversals when upcalls into other subsystems result in
socket/socket buffer operations. Assert that the socket buffer
lock is not held at the end of the function.
- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
have _locked versions which assert the socket buffer lock on
entry. If a wakeup is required by sb_notify(), invoke
sowakeup(); otherwise, unconditionally release the socket buffer
lock. This results in the socket buffer lock being released
whether a wakeup is required or not.
- Break out socantsendmore() into socantsendmore_locked() that
asserts the socket buffer lock. socantsendmore()
unconditionally locks the socket buffer before calling
socantsendmore_locked(). Note that both functions return with
the socket buffer unlocked as socantsendmore_locked() calls
sowwakeup_locked() which has the same properties. Assert that
the socket buffer is unlocked on return.
- Break out socantrcvmore() into socantrcvmore_locked() that
asserts the socket buffer lock. socantrcvmore() unconditionally
locks the socket buffer before calling socantrcvmore_locked().
Note that both functions return with the socket buffer unlocked
as socantrcvmore_locked() calls sorwakeup_locked() which has
similar properties. Assert that the socket buffer is unlocked
on return.
- Break out sbrelease() into a sbrelease_locked() that asserts the
socket buffer lock. sbrelease() unconditionally locks the
socket buffer before calling sbrelease_locked().
sbrelease_locked() now invokes sbflush_locked() instead of
sbflush().
- Assert the socket buffer lock in socket buffer sanity check
functions sblastrecordchk(), sblastmbufchk().
- Assert the socket buffer lock in SBLINKRECORD().
- Break out various sbappend() functions into sbappend_locked()
(and variations on that name) that assert the socket buffer
lock. The !_locked() variations unconditionally lock the socket
buffer before calling their _locked counterparts. Internally,
make sure to call _locked() support routines, etc, if already
holding the socket buffer lock.
- Break out sbinsertoob() into sbinsertoob_locked() that asserts
the socket buffer lock. sbinsertoob() unconditionally locks the
socket buffer before calling sbinsertoob_locked().
- Break out sbflush() into sbflush_locked() that asserts the
socket buffer lock. sbflush() unconditionally locks the socket
buffer before calling sbflush_locked(). Update panic strings
for new function names.
- Break out sbdrop() into sbdrop_locked() that asserts the socket
buffer lock. sbdrop() unconditionally locks the socket buffer
before calling sbdrop_locked().
- Break out sbdroprecord() into sbdroprecord_locked() that asserts
the socket buffer lock. sbdroprecord() unconditionally locks
the socket buffer before calling sbdroprecord_locked().
- sofree() now calls socantsendmore_locked() and re-acquires the
socket buffer lock on return. It also now calls
sbrelease_locked().
- sorflush() now calls socantrcvmore_locked() and re-acquires the
socket buffer lock on return. Clean up/mess up other behavior
in sorflush() relating to the temporary stack copy of the socket
buffer used with dom_dispose by more properly initializing the
temporary copy, and selectively bzeroing/copying more carefully
to prevent WITNESS from getting confused by improperly
initialized mutexes. Annotate why that's necessary, or at
least, needed.
- soisconnected() now calls sbdrop_locked() before unlocking the
socket buffer to avoid locking overhead.
Some parts of this change were:
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
socket lock over pulling so_options and so_linger out of the socket
structure in order to retrieve a consistent snapshot. This may be
overkill if user space doesn't require a consistent snapshot.
resolved by socket locking: in particular, that we test the connection
state at the socket layer without locking, request that the protocol
begin listening, and then set the listen state on the socket
non-atomically, resulting in a non-atomic cross-layer test-and-set.
to vm_map_find() that is less likely to be outside of addressable memory
for 32-bit processes: just past the end of the largest possible heap.
This is the same hint that mmap() uses.
ki_childutime, and ki_emul. Also uses the timevaladd() routine to
correct the calculation of ki_childtime. That will correct the value
returned when ki_childtime.tv_usec > 1,000,000.
This also implements a new KERN_PROC_GID option for kvm_getprocs().
(there will be a similar update to lib/libkvm/kvm_proc.c)
Submitted by: Cyrille Lefevre
lock state. Convert tsleep() into msleep() with socket buffer mutex
as argument. Hold socket buffer lock over sbunlock() to protect sleep
lock state.
Assert socket buffer lock in sbwait() to protect the socket buffer
wait state. Convert tsleep() into msleep() with socket buffer mutex
as argument.
Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation. Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy. Assert the socket buffer mutex
strategically after code sections or at the beginning of loops. In
some cases, modify return code to ensure locks are properly dropped.
Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.
Drop some spl use.
NOTE: Some races exist in the current structuring of sosend() and
soreceive(). This commit only merges basic socket locking in this
code; follow-up commits will close additional races. As merged,
these changes are not sufficient to run without Giant safely.
Reviewed by: juli, tjr
output to permanently (not ephemerally) go to the console. It is also
sent to any other console specified by TIOCCONS as normal.
While I'm here, document the kern.log_console_output sysctl.
be suspended in thread_suspend_check, after they are resumed, all
threads will call thread_single, but only one can be success,
others should retry and will exit in thread_suspend_check.
rwatson_netperf:
Introduce conditional locking of the socket buffer in fifofs kqueue
filters; KNOTE() will be called holding the socket buffer locks in
fifofs, but sometimes the kqueue() system call will poll using the
same entry point without holding the socket buffer lock.
Introduce conditional locking of the socket buffer in the socket
kqueue filters; KNOTE() will be called holding the socket buffer
locks in the socket code, but sometimes the kqueue() system call
will poll using the same entry points without holding the socket
buffer lock.
Simplify the logic in sodisconnect() since we no longer need spls.
NOTE: To remove conditional locking in the kqueue filters, it would
make sense to use a separate kqueue API entry into the socket/fifo
code when calling from the kqueue() system call.
- Lock down low hanging fruit use of sb_flags with socket buffer
lock.
- Lock down low hanging fruit use of so_state with socket lock.
- Lock down low hanging fruit use of so_options.
- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
socket buffer lock.
- Annotate situations in which we unlock the socket lock and then
grab the receive socket buffer lock, which are currently actually
the same lock. Depending on how we want to play our cards, we
may want to coallesce these lock uses to reduce overhead.
- Convert a if()->panic() into a KASSERT relating to so_state in
soaccept().
- Remove a number of splnet()/splx() references.
More complex merging of socket and socket buffer locking to
follow.
The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
other modules to explode. eg: snd_ich->snd_pcm and umass->usb.
The problem was that I was using the unified base address of the module
instead of finding the start address of the section in question.
success and a proper errno value on failure. This makes it
consistent with cv_timedwait(), and paves the way for the
introduction of functions such as sema_timedwait_sig() which can
fail in multiple ways.
Bump __FreeBSD_version and add a note to UPDATING.
Approved by: scottl (ips driver), arch
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:
SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)
Rename respectively to:
SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)
This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
SOCK_LOCK(so):
- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.
- Assert socket lock in MAC entry point implementations.
- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.
order definition for witness. Send lock before receive lock, and
socket locks after accept but before select:
filedesc -> accept -> so_snd -> so_rcv -> sellck
All routing locks after send lock:
so_rcv -> radix node head
All protocol locks before socket locks:
unp -> so_snd
udp -> udpinp -> so_snd
tcp -> tcpinp -> so_snd
reference count:
- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().
- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().
- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.
- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.
- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
protect fields in the socket buffer. Add accessor macros to use the
mutex (SOCKBUF_*()). Initialize the mutex in soalloc(), and destroy
it in sodealloc(). Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.
Submitted by: sam
Sponosored by: FreeBSD Foundation
Obtained from: BSD/OS
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
(time grows downward)
thread 1 thread 2
------------|------------
dec ref_cnt |
| dec ref_cnt <-- ref_cnt now zero
cmpset |
free all |
return |
|
alloc again,|
reuse prev |
ref_cnt |
| cmpset, read
| already freed
| ref_cnt
------------|------------
This should fix that by performing only a single
atomic test-and-set that will serve to decrement
the ref_cnt, only if it hasn't changed since the
earlier read, otherwise it'll loop and re-read.
This forces ordering of decrements so that truly
the thread which did the LAST decrement is the
one that frees.
This is how atomic-instruction-based refcnting
should probably be handled.
Submitted by: Julian Elischer
Add two new functions: ttyref() and ttyrel(). ttymalloc() creates a struct
tty with a reference count of one. when ttyrel sees the count go to zero,
struct tty is freed.
Hold references for open ttys and for ttys which are controlling terminal
for sessions.
Until drivers start using ttyrel(), this commit will make no difference.
when I reordered events in accept1() to allocate a file descriptor
earlier, I didn't properly update use of goto on exit to unwind for
cases where the file descriptor is now held, but wasn't previously.
The result was that, in the event of accept() on a non-blocking socket,
or in the event of a socket error, a file descriptor would be leaked.
This ended up being non-fatal in many cases, as the file descriptor
would be properly GC'd on process exit, so only showed up for processes
that do a lot of non-blocking accept() calls, and also live for a long
time (such as qmail).
This change updates the use of goto targets to do additional unwinding.
Eyes provided by: Brian Feldman <green@freebsd.org>
Feet, hands provided by: Stefan Ehmann <shoesoft@gmx.net>,
Dimitry Andric <dimitry@andric.com>
Arjan van Leeuwen <avleeuwen@piwebs.com>