eliminating a second set of identical mutex operations at the bottom.
This allows brief exceeding of the max sockets limit, but only by
sockets in the last stages of being torn down.
- If we fail to register the system call during MOD_LOAD, then note that
so that we don't try to deregister it or invoke the chained event handler
during the subsequent MOD_UNLOAD event. Doing the deregister when the
register failed could result in trashing system call entries.
- Add a SI_SUB_SYSCALLS just before starting up init and use that to
register syscall modules instead of SI_SUB_DRIVERS. Registering system
calls as late as possible increases the chances that any other module
event handlers or SYSINITs in a module are executed to initialize the
data in a kld before a syscall dependent on that data is able to be
invoked.
MFC after: 3 days
longer referenced by other threads (hence our freeing it), we don't need
to set the can't send and can't receive flags, wake up the consumers,
perform two levels of locking, etc. Implement a fast-path teardown,
sbdestroy(), which flushes and releases each socket buffer. A manual
dom_dispose of the receive buffer is still required explicitly to GC
any in-flight file descriptors, etc, before flushing the buffer.
This results in a 9% UP performance improvement and 16% SMP performance
improvement on a tight loop of socket();close(); in micro-benchmarking,
but will likely also affect CPU-bound macro-benchmark performance.
UNIX domain socket at the same time as the remote host is closing the
new connections as quickly as they open. Since the connect() and
send() paths are non-atomic with respect to another, it is possible
for the second thread's close() call to disconnect the two sockets
as connect() returns, leading to the consumer (which plans to send())
with a NULL kernel pointer to its proposed peer. As a result, after
acquiring the UNIX domain socket subsystem lock, we need to revalidate
the connection pointers even though connect() has technically succeed,
and reurn an error to say that there's no connection on which to
perform the send.
We might want to rethink the specific errno number, perhaps ECONNRESET
would be better.
PR: 100940
Reported by: Young Hyun <youngh at caida dot org>
MFC after: 2 weeks
MFC note: Some adaptation will be required
mark system calls as being MPSAFE:
- Stop conditionally acquiring Giant around system call invocations.
- Remove all of the 'M' prefixes from the master system call files.
- Remove support for the 'M' prefix from the script that generates the
syscall-related files from the master system call files.
- Don't explicitly set SYF_MPSAFE when registering nfssvc.
a count of all non-spin locks, not just lockmgr locks. This can give us a
much cheaper way to see if we have any locks held (such as when returning
to userland via userret()) without requiring WITNESS.
MFC after: 1 week
kern_fstatfs() so that it is still held when prison_enforce_statfs() is
called (since that function likes to poke and prod at the mountpoint
structure).
MFC after: 3 days
all other mtx_lock() operations to block. Previously, when the mutex was
destroyed, it would still have a valid value in mtx_lock(): either the
unowned cookie, which would allow a subsequent mtx_lock() to succeed, or a
pointer to the thread who destroyed the mutex if the mutex was locked when
it was destroyed.
MFC after: 3 days
kern_accept() and accept1(). If another thread closed the new file
descriptor and the first thread later got an error trying to copyout the
socket address, then it would attempt to close the wrong file object. To
fix, add a struct file ** argument to kern_accept(). If it is non-NULL,
then on success kern_accept() will store a pointer to the new file object
there and not release any of the references. It is up to the calling code
to drop the references appropriately (including a call to fdclose() in case
of error to safely handle the aforementioned race). While I'm at it, go
ahead and fix the svr4 streams code to not leak the accept fd if it gets an
error trying to copyout the streams structures.
have been invoked by uipc_close() or uipc_abort(), and the socket is in a
state of being torn down by the time we get to this point, so kqueue
state frobbed by soisdisconnected() is not available, so frobbing it will
result in a panic.
Reported by: Munehiro Matsuda <haro at h4 dot dion dot ne dot jp>
specific routines from uipc_socket2.c following repo-copy. We might
rethink the location of one or two at some point, but the division was
relatively clean. uipc_sockbuf.c is now the home of routines that
manipulate socket buffers.
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).
This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.
Architectural head nod: sam, gnn, wollman
the kernel malloc(9) state for vmstat -m. libmemstat is now used to
generate a machine-readable version which is converged by vmstat -m
into a human-readable version.
Not for MFC.
used to mark UNIX domain sockets as being in the process of binding or
connecting. Use these to prevent simultaneous bind or connect
operations by multiple threads or processes on the same socket at the
same time, which closes race conditions present in the UNIX domain
socket implementation since inception.
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.
This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.
Reviewed by: gnn
that the only remove hook operation that can occur while processing the
hooks is to remove the currently executing hook. This should be safe as
the existing code has assumed this already for a long time now.
Reviewed by: scottl
MFC after: 1 week
don't want to mix process and thread scheduling options together in these
functions, now the thread scheduling option is implemented in new thr
syscalls.
these syscalls are designed to set thread's scheduling parameters and
policy, because each syscall contains a size parameter, it is possible
to support future scheduling option, e.g SCHED_SPORADIC, this option
needs other fields in structure sched_param, current they are not
avaiblable.
install custom pager functions didn't actually happen in practice (they
all just used the simple pager and passed in a local quit pointer). So,
just hardcode the simple pager as the only pager and make it set a global
db_pager_quit flag that db commands can check when the user hits 'q' (or a
suitable variant) at the pager prompt. Also, now that it's easy to do so,
enable paging by default for all ddb commands. Any command that wishes to
honor the quit flag can do so by checking db_pager_quit. Note that the
pager can also be effectively disabled by setting $lines to 0.
Other fixes:
- 'show idt' on i386 and pc98 now actually checks the quit flag and
terminates early.
- 'show intr' now actually checks the quit flag and terminates early.
SCHED_OTHER, the same range as rtprio() is using. In old code,
it returns nice range -20 .. 20, nice should be treated as process
weight, it is really managed by getpriority() and setpriority()
syscalls, they are different.
using sorele() and the full tear-down path. Since protocol state
allocation failed, this is not required (and is arguably undesirable).
This matches the behavior of sonewconn() under the same circumstances.
locks and the unplock during uipc_rcvd() and uipc_send() by caching
certain values from one structure while its locks are held, and
applying them to a second structure while its locks are held. If
done carefully, this should be correct, and will reduce the amount
of work done with the global unp lock held.
Tested by: kris (earlier version)
ibcs2_getdents(), ibcs2_read(), ogetdirentries(), svr4_sys_getdents(),
and svr4_sys_getdents64() similar to that in getdirentries().
- Mark ibcs2_getdents(), ibcs2_read(), linux_getdents(), linux_getdents64(),
linux_readdir(), ogetdirentries(), svr4_sys_getdents(), and
svr4_sys_getdents64() MPSAFE.
done this in ptrace syscall, when a pid is large than PID_MAX, the syscall
will search a thread in current process. It permits 1:1 thread library to
get and set a thread's scheduler attributes.
unless current thread is realtime thread, in such case, we set a new zero
priority for it, notice we don't have per-thread nice, the priority
passed by userland is ignored here.
use by ABI emulators.
- Alter the interface of kern_recvit() somewhat. Specifically, go ahead
and hard code UIO_USERSPACE in the uio as that's what all the callers
specify. In place, add a new uioseg to indicate what type of pointer
is in mp->msg_name. Previously it was always a userland address, but
ABI emulators may pass in kernel-side sockaddrs. Also, remove the
namelenp field and instead require the two places that used it to
explicitly copy mp->msg_namelen out to userland.
- Use the patched kern_recvit() to replace svr4_recvit() and the stock
kern_sendit() to replace svr4_sendit().
- Use kern_bind() instead of stackgap use in ti_bind().
- Use kern_getpeername() and kern_getsockname() instead of stackgap in
svr4_stream_ti_ioctl().
- Use kern_connect() instead of stackgap in svr4_do_putmsg().
- Use kern_getpeername() and kern_accept() instead of stackgap in
svr4_do_getmsg().
- Retire the stackgap from SVR4 compat as it is no longer used.
that the 'data' pointer is already setup to point to a valid KVM buffer
or contains the copied-in data from userland as appropriate (ioctl(2)
still does this). kern_ioctl() takes care of looking up a file pointer,
implementing FIONCLEX and FIOCLEX, and calling fi_ioctl().
- Use kern_ioctl() to implement xenix_rdchk() instead of using the stackgap
and mark xenix_rdchk() MPSAFE.
mostly consists of pushing a few copyin's and copyout's up into
__semctl() as all the other callers were already doing the UIO_SYSSPACE
case. This also changes kern_semctl() to set the return value in a passed
in pointer to a register_t rather than td->td_retval[0] directly so that
callers can only set td->td_retval[0] if all the various copyout's succeed.
As a result of these changes, kern_semctl() no longer does copyin/copyout
(except for GETALL/SETALL) so simplify the locking to acquire the semakptr
mutex before the MAC check and hold it all the way until the end of the
big switch statement. The GETALL/SETALL cases have to temporarily drop it
while they do copyin/malloc and copyout. Also, simplify the SETALL case to
remove handling for a non-existent race condition.
to use the hinted child system. Bus drivers that use this need to
implmenet the bus_hinted_child method, where they actually add the
child to their bus, as they see fit. The bus is repsonsible for
getting the attribtues for the child, adding it in the right order,
etc. ISA hinting will be updated to use this method.
MFC After: 3 days
subr_acl_posix1e.c, leaving kern_acl.c containing only ACL system
calls and utility routines common across ACL types.
Add subr_acl_posix1e.c to the build.
Obtained from: TrustedBSD Project
not all known to be MPSAFE yet.
- Actually remove Giant from the kernel linker by taking it out of the
KLD_LOCK() and KLD_UNLOCK() macros.
Pointy hat to: jhb (2)
ibcs2_[gs]etgroups() rather than using the stackgap. This also makes
ibcs2_[gs]etgroups() MPSAFE. Also, it cleans up one bit of weirdness in
the old setgroups() where it allocated an entire credential just so it had
a place to copy the group list into. Now setgroups just allocates a
NGROUPS_MAX array on the stack that it copies into and then passes to
kern_setgroups().
a local 'semid' variable which was the array index and used uap->semid
as the original IPC id. During the kern_semctl() conversion those two
variables were collapsed into a single 'semid' variable breaking the
places that needed the original IPC ID. To fix, add a new 'semidx'
variable to hold the array index and leave 'semid' unmolested as the IPC
id. While I'm here, explicitly document that the (undocumented, at least
in semctl(2)) SEM_STAT command curiously expects an array index in the
'semid' parameter rather than an IPC id.
Submitted by: maxim
to a copied-in copy of the 'union semun' and a uioseg to indicate which
memory space the 'buf' pointer of the union points to. This is then used
in linux_semctl() and svr4_sys_semctl() to eliminate use of the stackgap.
- Mark linux_ipc() and svr4_sys_semsys() MPSAFE.
from going away. mount(2) is now MPSAFE.
- Expand the scope of Giant some in unmount(2) to protect the mp structure
(or rather, to handle concurrent unmount races) from going away.
umount(2) is now MPSAFE, as well as linux_umount() and linux_oldumount().
- nmount(2) and linux_mount() were already MPSAFE.
- For privileged processes safe two mutex operations.
We may want to consider if this is good idea to use SUSER_ALLOWJAIL here,
but for now I didn't wanted to change the original behaviour.
Reviewed by: rwatson
all of the module event handlers are MP safe yet, so always acquire Giant
for now when invoking module event handlers. Eventually we can add an
MPSAFE flag or some such and add appropriate locking to all module event
handlers.
in 1999, and there are changes to the sysctl names compared to PR,
according to that discussion. The description is in sys/conf/NOTES.
Lines in the GENERIC files are added in commented-out form.
I'll attach the test script I've used to PR.
PR: kern/14584
Submitted by: babkin
protect all linker-related data structures including the contents of
linker file objects and the any linker class data as well. Considering how
rarely the linker is used I just went with the simple solution of
single-threading the whole thing rather than expending a lot of effor on
something more fine-grained and complex. Giant is still explicitly
acquired while registering and deregistering sysctl's as well as in the
elf linker class while calling kmupetext(). The rest of the linker runs
without Giant unless it has to acquire Giant while loading files from a
non-MPSAFE filesystem.
- Add a new function linker_release_module() as a more intuitive complement
to linker_reference_module() that wraps linker_file_unload().
linker_release_module() can either take the module name and version info
passed to linker_reference_module() or it can accept the linker file
object returned by linker_reference_module().
file objects calling a user-specified predicate function on each object.
The iteration terminates either when the entire list has been iterated
over or the predicate function returns a non-zero value.
linker_file_foreach() returns the value returned by the last invocation
of the predicate function. It also accepts a void * context pointer that
is passed to the predicate function as well. Using an iterator function
avoids exposing linker internals to the rest of the kernel making locking
simpler.
- Use linker_file_foreach() instead of walking the list of linker files
manually to lookup ndis files in ndis(4).
- Use linker_file_foreach() to implement linker_hwpmc_list_objects().
in setsockopt so that they can be compared correctly against negative
values. Passing in a negative value had a rather negative effect
on our socket code, making it impossible to open new sockets.
PR: 98858
Submitted by: James.Juran@baesystems.com
MFC after: 1 week
It is similar to debug.kdb.trap, except for it tries to cause a page fault
via a call to an invalid pointer. This can highlight differences between
a fault on data access vs. a fault on code call some CPUs might have.
This appeared as a test for a work \
Sponsored by: RiNet (Cronyx Plus LLC)
basically always violated) invariannts of soreceive(), which assume
that the first mbuf pointer in a receive socket buffer can't change
while the SB_LOCK sleepable lock is held on the socket buffer,
which is precisely what these functions do. No current protocols
invoke these functions, and removing them will help discourage them
from ever being used. I should have removed them years ago, but
lost track of it.
MFC after: 1 week
Prodded almost by accident by: peter
frequency, quality and current value of each available time counter.
At the moment all of these are read-only, but it might make sense to
make some of these read-write in the future.
MFC after: 3 months
filesystem agnostic. We are not touching any file system specific functions
in this code path. Since we have a cache lock, there is really no need to
keep Giant around here.
This eliminates Giant acquisitions for any syscall which is auditing pathnames.
Discussed with: jeff
yield() and sched_yield() syscalls. Every scheduler has its own way
to relinquish cpu, the ULE and CORE schedulers have two internal run-
queues, a timesharing thread which calls yield() syscall should be
moved to inactive queue.
KASSERT(ke->ke_runq == NULL) panic when the sched_add is recursively
called by maybe_preempt.
Reported by: Wojciech A. Koszek < dunstan at freebsd dot czest dot pl >
we intend for the user to be able to unload them later via kldunload(2)
instead of calling linker_load_module() and then directly adjusting the
ref count on the linker file structure. This makes the resulting
consumer code simpler and cleaner and better hides the linker internals
making it possible to sanely lock the linker.
Giant down in it.
- Push Giant down in kern_kldunload() and reorganize it slightly to avoid
using gotos. Also, expose this function to the rest of the kernel.
- Use a 'struct kld_file_stat' on the stack to read data under the lock
and then do one copyout() w/o holding the lock at the end to push the
data out to userland.
linker_file_unload() instead of in the middle of a bunch of code for
the case of dropping the last reference to improve readability and sanity.
While I'm here, remove pointless goto's that were just jumping to a
return statement.
sockets:
1) A sender sends SCM_CREDS message to a reciever, struct cmsgcred;
2) A reciever sets LOCAL_CREDS socket option and gets sender
credentials in control message, struct sockcred.
Both methods use the same control message type SCM_CREDS with the
same control message level SOL_SOCKET, so they are indistinguishable
for the receiver. A difference in struct cmsgcred and struct sockcred
layouts may lead to unwanted effects.
Now for sockets with LOCAL_CREDS option remove all previous linked
SCM_CREDS control messages and then add a control message with
struct sockcred so the process specifically asked for the peer
credentials by LOCAL_CREDS option always gets struct sockcred.
PR: kern/90800
Submitted by: Andrey Simonenko
Regres. tests: tools/regression/sockets/unix_cmsg/
MFC after: 1 month
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.
Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.
with firmware_unregister(). Previously when the last driver reference
had been dropped we would clear the list entry under the assumption
that the firmware module was about to be unloaded, but this was not
true if the firmware image had been loaded manually with kldload.
This makes it possible to manually kldload firmware images as a
workaround for drivers such as ipw that attempt to load firmware
while resuming after a suspend.
Reviewed by: mlaier (an earlier version of the patch)
- Move sonewconn(), which creates new sockets for incoming connections on
listen sockets, so that all socket allocate code is together in
uipc_socket.c.
- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
socket allocation code.
- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
to sysctl.h and remove lots of scattered implementations in various
IPC modules.
- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
reasons. Statisticize soalloc() and sodealloc() as they are now
required only in uipc_socket.c, and are internal to the socket
implementation.
After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.
MFC after: 1 month
non-intuitive for the ~ to be built into the mask. All the users now
explicitly ~ the mask. In addition, add MTX_UNOWNED to the mask even
though it technically isn't a flag. This should unbreak mtx_owner().
Quickly spotted by: kris
forget to unbusy file system before its destruction.
This fixes the following warning on mount failure:
Mount point <X> had 1 dangling refs
Tested by: wkoszek
a) were incorrectly written and therefore never compiled into
assertions, and
b) were incorrectly specified and when compiled resulted in a
failed assertion.
vmspace_exitfree() and vmspace_free() which could result in the same
vmspace being freed twice.
Factor out part of exit1() into new function vmspace_exit(). Attach
to vmspace0 to allow old vmspace to be freed earlier.
Add new function, vmspace_acquire_ref(), for obtaining a vmspace
reference for a vmspace belonging to another process. Avoid changing
vmspace refcount from 0 to 1 since that could also lead to the same
vmspace being freed twice.
Change vmtotal() and swapout_procs() to use vmspace_acquire_ref().
Reviewed by: alc
lookup, rename, strategy, islocked
The missing % sign meant that the lines were processed as plain
comments and the corresponding assertions were never generated.
This used to make syscons switch to vty0 when we entered DDB but this
was lost in the KDB shuffle. We may want to bring it back down the road
but it should be done by calling cn_init_t/cn_term_t instead, possibly
with a flag argument saying "Debugger!"
sendfile(). This causes sendfile() to use the file descriptor
reference to the socket instead of bumping the socket reference
count, which avoids an additional refcount operation, as well as a
potential expensive socket refcount drop, which can lead to
contention on the accept mutex. This change also has the side
effect of further reducing the number of cases where an in-progress
I/O operation can occur on a socket after close, as using the file
descriptor refcount prevents the socket from closing while in use.
MFC after: 3 months
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.
This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.
Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by: tegge@
MFC after: 7 days
stopped before adjusting their priority and setting them on the run
q so they cannot race for resources (pointed out by njl).
While here add a console printf on thread create fails; otherwise
noone may notice (e.g. return value is always 0 and caller has no
way to verify).
Reviewed by: jhb, scottl
MFC after: 2 weeks
mount(2) system call:
* Add cmount hook to fdescfs and pseudofs (and, by extension, procfs and
linprocfs). This (mostly) restores the ability to mount these
filesystems using the old mount(2) system call (see below for the
rest of the fix).
* Remove not-NULL check for the data argument from the mount(2) entry
point. Per the mount(2) man page, it is up to the individual
filesystem being mounted to verify data. Or, in the case of procfs,
etc. the filesystem is free to ignore the data parameter if it does
not use it. Enforcing data to be not-NULL in the mount(2) system call
entry point prevented passing NULL to filesystems which ignored the
data pointer value. Apparently, passing NULL was common practice
in such cases, as even our own mount_std(8) used to do it in the
pre-nmount(2) world.
All userland programs in the tree were converted to nmount(2) long ago,
but I've found at least one external program which broke due to this
(presumably unintentional) mount(2) API change. One could argue that
external programs should also be converted to nmount(2), but then there
isn't much point in keeping the mount(2) interface for backward
compatibility if it isn't backward compatible.
When porting FreeBSD to a new platform, one of the more useful things to do is
get mi_startup() to let you know which SYSINIT it's up to. Most people tend to
whack a printf in the SYSINIT loop to print the address of the function it's
about to call. Going one better, jhb made a version that uses DDB to look up
the name of the function and print that instead. This version is essentially
his with the addition of some ifdeffery to make it optional and to allow it to
work (although using only the function address, not the symbol) if you forgot
to enable DDB.
All the cool bits by: jhb
Approved by: scottl, rink, cognet, imp
vn_start_write() is always called earlier in the code path and calling
the function recursively may lead to a deadlock.
Confirmed by: tegge
MFC after: 2 weeks
vn_finished_write() should also be called only then.
BTW. I fixed two functions here: vn_rdwr() and vn_write(). The latter seems
to be unused.
MFC after: 3 weeks
buffers to go on the buf daemon's DIRTYGIANT queue.
- Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
runs in the context of the buf daemon and may require Giant.
than trying to optimize it into a single lock. This adds more calls to
lock giant with non smpsafe filesystems but is the only way to reliably
hold the correct lock.
- Remove an invalid assert in the mountedhere case in lookup and fix the
code to properly deal with the scenario. We can actually have a lookup
that returns dp == dvp with mountedhere set with certain unmount races.
Tested by: kris
Reported by: kris/mohans
problems in ddb:
- "show threadchain [thread]" will start with the specified thread (or the
current kdb thread by default) and show it's state. If it is blocked on
a lock, it will find the owner of the lock and show its state, etc.
- "show allchains" will find all of the threads that are blocked on a
lock (but do not have any threads blocked on a lock they hold) and show
the resulting thread chain.
- "show lockchain <lock>" takes a pointer to a lock_object (such as a
mutex or rwlock). If there is a turnstile for that lock, then it will
display all the threads blocked on the lock. In addition, for each
thread blocked on the lock, it will display any contested locks they
hold, and recurse on those locks to show any threads blocked on those
locks, etc.
file lock, in the style of fgetsock().
Modify accept1() to use getsock() instead of fgetsock(), relying on the
file descriptor reference rather than an acquired socket reference to
prevent the listen socket from being destroyed during accept(). This
avoids additional reference count operations, which should improve
performance, and also avoids accept1() operating on a socket whose file
descriptor has been torn down, which may have resulted in protocol
shutdown starting.
MFC after: 3 months
function along with the remainder of the reference checking code. Move
comment from body to header with remainder of comments. Inclusion of a
socket in a completed connection queue counts as a true reference, and
should not be handled as an under-documented edge case.
MFC after: 3 months
locked. In general the adaptive spinning is similar to the same code
for mutexes with some extra trickiness in rw_wunlock_hard(). Specifically,
even though both wait bits might be set and we might have a turnstile with
at least one waiting thread, there might not be any threads blocked on the
queue we are not waking up (they might all be spinning), and we should
only preserve the waiting flag for the queue we aren't waking up if there
are in fact threads blocked on that queue. Secondly, there might not be
any threads blocked on the queue we have chosen to waken threads from
(there might only be threads blocked on the other queue and the threads
for this queue are all spinning) in which case we disown the turnstile
instead of doing a braodcast and unpend.
use it in places that only care about the write owner instead of
rw_owner() as a baby step towards limited read-lock owner.
- Tidy the code that sets the WAITER flag bits to not duplicate a test
around the atomic operation and the KTR trace in both of the lock
functions.
with a given module_t. I use this in some the MOD_LOAD event handler for
some test kernel modules to ask the kernel linker to look up the linker
sets in my test modules. (I use linker sets to generate the list of
possible events that I then signal to execute via a sysctl. On non-amd64,
ld(8) would resolve the entire linker set, but on amd64 I have to ask the
kernel linker to do it for me, and having the kernel linker do it works on
all archs.)
if the specified priority is zero. This avoids a race where the calling
thread could read a snapshot of it's current priority, then a different
thread could change the first thread's priority, then the original thread
would call sched_prio() inside msleep() undoing the change made by the
second thread. I used a priority of zero as no thread that calls msleep()
or tsleep() should be specifying a priority of zero anyway.
The various places that passed 'curthread->td_priority' or some variant
as the priority now pass 0.
compiler doesn't decide to cache td_state. Cachine the state would cause
the spinning thread to not notice when the owning thread stopped executing
(if it was preempted for example) which could result in livelock.
than keeping it locked until we exit the function to optimize the case
where the lock would be dropped and later reacquired. The optimization
was broken when kevent's were moved from UFS to VFS and the knote list
lock for a vnode kevent became the lockmgr vnode lock. If one tried
to use a kqueue that contained events for a kqueue fd followed by a vnode,
then the kq global lock would end up being held when the vnode lock was
acquired which could result in sleeping with a mutex held (and subsequent
panics) if the vnode lock was contested.
Reviewed by: jmg
Tested by: ps (on 6.x)
MFC after: 3 days
not need to clear it now, this should fix panic when msleep is recursivly
called. Patch is slightly adjusted after review.
Reviewed by: jhb
Tested by: Csaba Henk, csaba-ml at creo.hu
MFC after: 3 days
doesn't appear to be protecting anything. Most of consumers funsetownlst(9)
do not appear to be picking up Giant anywhere. This was originally a part
of my Giant exit(2) clean up revision 1.272 but I thought it was a good idea
to leave it out until we were able to analyze it better.
Tested by: kris
MFC after: 3 weeks
as being undocumented in Stevens, and was broken in 1997 during network
stack infrastructure work. It is the one remaining (and incorrect)
direct protocol reference to raw_usrreq.pru_attach; this is incorrect
because the raw socket code assumes that raw_uattach is called only after
the protocol has allocated a PCB.
MFC after: 3 months
recycling for an unrelated filesystem. I really don't like potentially
acquiring giant in the context of a giantless filesystem but there
are reasonable objections to removing the recycling from this path.
Sponsored by: Isilon Systems, Inc.
PCB in which the context of stopped CPUs is stored. To access this
PCB from KDB, we introduce a new define, called KDB_STOPPEDPCB. The
definition, when present, lives in <machine/kdb.h> and abstracts
where MD code saves the context. Define KDB_STOPPEDPCB on i386,
amd64, alpha and sparc64 in accordance to previous code.
intr_disable() and intr_restore() resp. Previously, critical
regions would have interrupts disabled, but that was changed.
Consequently, the debugger could run with interrupts enabled.
This could cause problems for the low-level console code where
received characters would trigger an interrupt that causes
the interrupt handler to read the character instead of the
cngetc() function.
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.
soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.
Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.
In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.
netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.
MFC after: 3 months
than an int, as an error here is not meaningful. Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.
This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit. This will be corrected shortly in followup
commits to these components.
MFC after: 3 months
the file descriptor reference, rather than paying additional lock
operations to acquire a socket reference from the file descriptor.
This will also help to ensure that file descriptor based socket
requests are not delivered to a socket after close. Most consumers
have already been converted to this model.
MFC after: 3 months
be present at this point. We will eventually remove this assert because
the socket layer should never look at so_pcb, but for now it's a useful
debugging tool.
MFC after: 3 months
socket calls relating to the creation and destruction of sockets. This
will eventually form the foundation of socket(9), but is currently in too
much flux to do so.
MFC after: 3 months
called.
- vfs_getvfs has to return a reference to prevent the returned mountpoint
from changing identities.
- Release references acquired via vfs_getvfs.
Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.
mount memory from being reclaimed. This resolves a number of race
conditions described in vfs_default.c and introduced with the
VFS_LOCK_GIANT macros.
- Let the mtx and lock remain valid after the mount structure has been
freed by using init and fini calls. Technically fini will never be
called but is included for completeness.
- Consistently use lockmgr directly rather than lockmgr to lock and
vfs_unbusy to unlock.
Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.
- Move the vn_lock of the dvp until after we've unbusied the filesystem
to avoid a LOR with the mount point lock.
- In the v_mountedhere while loop we acquire a new instance of giant each
time through without releasing the first. This would cause us to leak
Giant.
Sponsored by: Isilon Systems, Inc.
requires Giant. It is set in bgetvp and cleared in brelvp.
- Create QUEUE_DIRTY_GIANT for dirty buffers that require giant.
- In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and
only if we think there are buffers in that queue.
Sponsored by: Isilon Systems, Inc.
failing, print a message when we fail for some reason as most callers do
not check the return value (e.g. 'cuz they're called from SYSINIT)
Reviewed by: scottl
MFC after: 1 week
controllers typically have multiple channels and support a number
of serial communications protocols. The scc(4) driver is itself
an umbrella driver that delegates the control over each channel
and mode to a subordinate driver (like uart(4)).
The scc(4) driver supports the Siemens SAB 82532 and the Zilog
Z8530 and replaces puc(4) for these devices.
a lock's priority to a sleeping thread. When we panic, dump a stack
trace of the thread that is asleep if DDB is compiled into the kernel
just before calling panic(). This is much more informative and useful
for debugging than the current behavior of getting a page fault and not
having an easy way of determining which thread caused the original problem.
MFC after: 1 week
a race where data could come in before we clear the INFLUX flag, and get
skipped over by knote (and hence never be activated, though it should of
been)...
Found by: glebius & co.
Reviewed by: glebius
MFC after: 3 days
generating a coredump as the result of a signal.
- Fix a bug where we could leak a Giant lock if vn_start_write() failed
in coredump().
Reported by: jmg (2)
and use that instead of testing fdidx against -1 to determine if it should
release Giant if Giant was locked due to the requested file residing on a
non-MPSAFE VFS.
Discussed with: jeff
arguments. The first one is never used (all callers pass in 0); the
second is sometimes used to pass in a struct timespec * which is used as
a timeout and never modified. Constify that argument so callers can pass
a const struct timespec * without jumping through hoops.
acquiring Giant in kern_sendfile().
Guard against the forced reclamation of a vnode in kern_sendfile().
Discussed with: jeff
Reviewed by: tegge
MFC after: 3 weeks
REGRESSION is enabled, allows user space to dictate that sonewconn()
should skip it's "skip the hard work" check to see if the listen
queue is full, and instead proceed with allocation of a socket and
trimming of the overflowed queue. This makes it easier to test the
queue overflow logic.
MFC after: 1 month
Kernel changes:
Inform hwpmc of executable objects brought into the system by
kldload() and mmap(), and of their removal by kldunload() and
munmap(). A helper function linker_hwpmc_list_objects() has been
added to "sys/kern/kern_linker.c" and is used by hwpmc to retrieve
the list of currently loaded kernel modules.
The unused `MAPPINGCHANGE' event has been deprecated in favour
of separate `MAP_IN' and `MAP_OUT' events; this change reduces
space wastage in the log.
Bump the hwpmc's ABI version to "2.0.00". Teach hwpmc(4) to
handle the map change callbacks.
Change the default per-cpu sample buffer size to hold
32 samples (up from 16).
Increment __FreeBSD_version.
libpmc(3) changes:
Update libpmc(3) to deal with the new events in the log file; bring
the pmclog(3) manual page in sync with the code.
pmcstat(8) changes:
Introduce new options to pmcstat(8): "-r" (root fs path), "-M"
(mapfile name), "-q"/"-v" (verbosity control). Option "-k" now
takes a kernel directory as its argument but will also work with
the older invocation syntax.
Rework string handling in pmcstat(8) to use an opaque type for
interned strings. Clean up ELF parsing code and add support for
tracking dynamic object mappings reported by a v2.0.00 hwpmc(4).
Report statistics at the end of a log conversion run depending
on the requested verbosity level.
Reviewed by: jhb, dds (kernel parts of an earlier patch)
Tested by: gallatin (earlier patch)
VFS_LOCK_GIANT/VFS_UNLOCK_GIANT calls. This completely removes Giant
acquisition in the syscall path for ffs.
Bug fix to kern_fhstatfs from: Todd Miller <Todd.Miller@sparta.com>
Sponsored by: Isilon Systems, Inc.
"fdinit() fails to initialize newfdp->fd_fd.fd_lastfile to -1. This breaks
fdcopy() which will incorrectly set newfdp->fd_freefile to 1 if no files are
open and the last file descriptor marked as unused for fdp was 0. This later
causes descriptor 0 to be unavailable in newfdp when the optimization is
enabled.
When the last file descriptor previously marked as used is nonzero and marked
as unused, fdunused() incorrectly sets fdp->fd_lastfile to fd - 1 due to
fd_last_used() returning (size - 1). This hides the problem that breaks the
optimization."
This allows us to keep the optimization, while un-breaking it.
This is a RELENG_6 candidate.
PR: kern/87208
MFC after: 1 week
Submitted by: tegge
the target directory or file. This case should fail in the filesystem
anyway and perhaps kern_rename() should catch it.
Sponsored by: Isilon Systems, Inc.
really breaking things. Simple "close(0); dup(fd)" does not return descriptor
"0" in some cases. Further, this change also breaks some MAC interactions with
mac_execve_will_transition(). Under certain circumstances, fdcheckstd() can
be called in execve(2) causing an assertion that checks to make sure that
stdin, stdout and stderr reside at indexes 0, 1 and 2 in the process fd table
to fail, resulting in a kernel panic when INVARIANTS is on.
This should also kill the "dup(2) regression on 6.x" show stopper item on the
6.1-RELEASE TODO list.
This is a RELENG_6 candidate.
PR: kern/87208
Silence from: des
MFC after: 1 week
defined for an in-use socket. This allows us to eliminate countless tests
of whether so_pcb is non-NULL, eliminating dozens of error cases. For
now, retain the call to sotryfree() in the uipc_abort() path, but this
will eventually move to soabort().
These new assumptions should be largely correct, and will become more so
as the socket/pcb reference model is fixed. Removing the notion that
so_pcb can be non-NULL is a critical step towards further fine-graining
of the UNIX domain socket locking, as the so_pcb reference no longer
needs to be protected using locks, instead it is a property of the socket
life cycle.
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.
specified, the rightmost option takes effect." Fix code to obey
this. This makes e.g. "mount -r /usr" or "mount -ar" actually
mount file systems read-only.
Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.
Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).
Perform deferred inactive handling after the file system is resumed.
triggers.
This should eliminate all the trivial messages which result from minor
increases in cpu_tick frequency.
Machines which don't du cpu clock fiddling shouldn't issue "backwards"
messages now.
Laptops and other machines where the initial estimate of cputicks may be
waaaay off will still issue warnings.
replacement for vn_write_suspend_wait() to better account for secondary write
processing.
Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.
Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.
whether or not to allocate a full mbuf cluster rather than just a plain
mbuf when adding on additional mbufs in m_getm(). In practice, there wasn't
any resulting mem trashing since m_getm() doesn't ever allocate an mbuf with
a packet header, and MINCLSIZE is the available payload in an mbuf with a
header rather than the available payload in a plain mbuf.
Discussed with: andre (lightly)
Releasing items from the mt_zone can not be done by a simple
uma_zfree() call since mt_zone is allocated with the UMA_ZONE_MALLOC
flag. Use uma_zfree_arg instead and supply the slab.
This bug caused panics in low memory situations on unloading kernel
modules containing MALLOC_DEFINE(..) statements.
Submitted by: ups
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.
Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris
so other threads can not see it if we unlock the proc
lock (this can happen in knlist_delete). Don't do wakeup,
it is not necessary.
2. Decrease kaio_buffer_count in biohelper rather than
doing it in aio_bio_done_notify.
3. In aio_bio_done_notify, don't send notification if KAIO_RUNDOWN
was set, because the process is already in single thread mode.
4. Use assignment to initialize aiothreadflags.
5. AIOCBLIST_RUNDOWN is not useful, axe the code using it.
6. use LIO_NOP instead of zero.
callout_drain() logic. We no longer need a separate non-spin mutex to
do sleep/wakeup with, instead we can now just use the one spin mutex to
manage all the callout functionality.
the last reference is dropped. I forgot that vnodes can stick around
for a very long time until processes discover that they are dead. This
means that a vnode reference is not sufficient to keep the mount
referenced and even more code will be required to ref mount points.
Discovered by: kris
- Reorder the events in exit(2) slightly so that we trigger the S_EXIT
stop event earlier. After we have signalled that, we set P_WEXIT and
then wait for any processes with a hold on the vmspace via PHOLD to
release it. PHOLD now KASSERT()'s that P_WEXIT is clear when it is
invoked, and PRELE now does a wakeup if P_WEXIT is set and p_lock drops
to zero.
- Change proc_rwmem() to require that the processing read from has its
vmspace held via PHOLD by the caller and get rid of all the junk to
screw around with the vmspace reference count as we no longer need it.
- In ptrace() and pseudofs(), treat a process with P_WEXIT set as if it
doesn't exist.
- Only do one PHOLD in kern_ptrace() now, and do it earlier so it covers
FIX_SSTEP() (since on alpha at least this can end up calling proc_rwmem()
to clear an earlier single-step simualted via a breakpoint). We only
do one to avoid races. Also, by making the EINVAL error for unknown
requests be part of the default: case in the switch, the various
switch cases can now just break out to return which removes a _lot_ of
duplicated PRELE and proc unlocks, etc. Also, it fixes at least one bug
where a LWP ptrace command could return EINVAL with the proc lock still
held.
- Changed the locking for ptrace_single_step(), ptrace_set_pc(), and
ptrace_clear_single_step() to always be called with the proc lock
held (it was a mixed bag previously). Alpha and arm have to drop
the lock while the mess around with breakpoints, but other archs
avoid extra lock release/acquires in ptrace(). I did have to fix a
couple of other consumers in kern_kse and a few other places to
hold the proc lock and PHOLD.
Tested by: ps (1 mostly, but some bits of 2-4 as well)
MFC after: 1 week
modules prior to looking up the directory which we will cover to avoid
this problem in mount.
- We must hold the coveredvp locked before we can busy the mountpoint to
prevent a lock order reversal with the vfs_busy() in lookup which holds
the directory lock prior to doing a vfs_busy(). The directory lock is
required to safely clear the v_mountedhere field on the directory.
MFC After: 1 week
prevent the mount point from going away while we're waiting on the lock.
The ref does not need to persist once we have the lock because the
lock prevents the mount point from being unmounted.
MFC After: 1 week
the VFS_STATFS call to prevent the mount from disappearing while we're
stating.
- Convert these routines to use MPSAFE namei semantics.
MFC After: 1 week