This information will be very useful for people who are tuning applications
which have a dependence on IPC mechanisms.
The following OIDs were documented:
Message queues:
kern.ipc.msgmax
kern.ipc.msgmni
kern.ipc.msgmnb
kern.ipc.msgtlq
kern.ipc.msgssz
kern.ipc.msgseg
Semaphores:
kern.ipc.semmap
kern.ipc.semmni
kern.ipc.semmns
kern.ipc.semmnu
kern.ipc.semmsl
kern.ipc.semopm
kern.ipc.semume
kern.ipc.semusz
kern.ipc.semvmx
kern.ipc.semaem
Shared memory:
kern.ipc.shmmax
kern.ipc.shmmin
kern.ipc.shmmni
kern.ipc.shmseg
kern.ipc.shmall
kern.ipc.shm_use_phys
kern.ipc.shm_allow_removed
kern.ipc.shmsegs
These new descriptions can be viewed using sysctl -d
PR: kern/65219
Submitted by: Dan Nelson <dnelson at allantgroup dot com> (modified)
No objections: developers@
Descriptions reviewed by: gnn
MFC after: 1 week
suid application. The problem is that Linux applications using old Linux
threads (pre-NPTL) use signal 32 (linux SIGRTMIN) for communication between
thread-processes. If such an linux application is installed suid or sgid
and security.bsd.conservative_signals=1 (default), then permission will be
denied to send such a signal and the application will freeze.
I believe the same will be true for native applications that use libthr,
since libthr uses SIGTHR for implementing conditional variables.
PR: 72922
Submitted by: Andriy Gapon <avg@icyb.net.ua>
MFC after: 2 weeks
list, set `curr_callout' to NULL. This ensures that we won't attempt
to cancel the current callout if the original callout structure
gets recycled while we wait to acquire Giant.
This is reported to fix an intermittent syscons problem that was
introduced by revision 1.96.
do not need to perform an extra memory fetch in the Packet (Mbuf+Cluster)
constructor to initialize the reference counter anymore. The reference
counts are located in a separate memory region (in the slab header,
because this zone is UMA_ZONE_REFCNT), so the memory fetch resulted very
often in a cache miss. Additionally, and perhaps more significantly,
optimize the free mbuf+cluster (packet) case, which is very common, to
no longer require an atomic operation on free (to verify the reference
counter) if the reference on the cluster has never been increased (also
very common). Reduces an atomic on mbuf free on average.
Original patch submitted by: Gerrit Nagelhout <gnagelhout@sandvine.com>
behaviour of chflags within a jail. If set to 0 (the default), then a
jailed root user is treated as an unprivileged user; if set to 1, then
a jailed root user is treated the same as an unjailed root user.
This is necessary to allow "make installworld" to work inside a jail,
since it attempts to manipulate the system immutable flag on certain
files.
Discussed with: csjp, rwatson
MFC after: 2 weeks
Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.
Remove all the background write related stuff from the normal bufwrite.
This drags the softdep_move_dependencies() back into FFS.
Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.
structure in the struct pointed to by the 3rd argument for IPC_STAT and
get rid of the 4th argument. The old way returned a pointer into the
kernel array that the calling function would then access afterwards
without holding the appropriate locks and doing non-lock-safe things like
copyout() with the data anyways. This change removes that unsafeness and
resulting race conditions as well as simplifying the interface.
- Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(),
fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of
code duplication for freebsd4 and netbsd compatability system calls.
- Add a new lookup function kern_alternate_path() that looks up a filename
under an alternate prefix and determines which filename should be used.
This is basically a more general version of linux_emul_convpath() that
can be shared by all the ABIs thus allowing for further reduction of
code duplication.
callout is first initialised, using a new function callout_init_mtx().
The callout system will acquire this mutex before calling the callout
function and release it on return.
In addition, the callout system uses the mutex to avoid most of the
complications and race conditions inherent in asynchronous timer
facilities, so mutex-protected callouts have much simpler semantics.
As long as the mutex is held when invoking callout_stop() or
callout_reset(), then these functions will guarantee that the callout
will be stopped, even if softclock() had already begun to process
the callout.
Existing Giant-locked callouts will automatically pick up the new
race-free semantics. This should close a number of race conditions
in the USB code and probably other areas of the kernel too.
There should be no change in behaviour for "MP-safe" callouts; these
still need to use the techniques mentioned in timeout(9) to avoid
race conditions.
frequency as a percentage of the base rate and do not change the base
rate directly. The cpufreq framework combines these with absolute drivers
to produce synthesized levels made of one or more settings.
select the CPU frequency level (say for cooling). The driver interface
allows hardware drivers to announce themselves as capable of adjusting
an individual frequency setting.
- Add buffer size limitations (overflow will not be possible anymore).
- Add 'visible' option, which will allow for passphrase reading in the
future.
- Remove special treatment of '@' and '#', those two are only confusing.
Discussed with: rwatson
MFC after: 2 weeks
tond and not fromnd. This could lead us to leak Giant, or unlock it
twice, depending on the filesystems involved. renames within a single
filesystem would not have caused any problems.
Sponsored by: Isilon Systems, Inc.
all reserved, as the lisence makes clear), and strike the third clause
(now this is a 2-clause liberal BSDL as are the rest of files I hold
copyright over).
copies arguments into the kernel space and one that operates
completely in the kernel space;
o use kernel-only version of execve(2) to kill another stackgap in
linuxlator/i386.
Obtained from: DragonFlyBSD (partially)
MFC after: 2 weeks
Add minor2unit() in addition to dev2unit() and unit2minor().
If it wasn't such a hazzle we should redefine minor numbers in
the kernel without the gap for the major number, but it's not worth
the bother (yet).
a process return to userspace if it had pending GEOM events.
We need to have the same check in the exit pass to catch the case
where a GEOM related filedescriptor is not explicitly closed by
the process.
Bumped into by: people using dd(1) to build releases, nanobsd etc.
from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrapper
of that syscall.
MFC after: 2 weeks
pops data from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrappers
of those syscalls.
MFC after: 2 weeks
missed that when the vnode bypass was introduced.
Deal with zero length transfers before we even get to fo_ops->fo_read().
Found by: Slawa Olhovchenkov <slwzxy.spb.ru@zxy.spb.ru>
PR: 75758
the name Sande^H^H^H^H^Hvnode_create_vobject().
Make the new function take a size argument which removes the need for
a VOP_STAT() or a very pessimistic guess for disks.
Call that new function from vop_stdcreatevobject().
Make vnode_pager_alloc() private now that its only user came home.
short to unsigned short.
- Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX.
Before this change setting somaxconn to smth above 32767 and calling
listen(fd, -1) lead to a socket, which doesn't accept connections at all.
Reviewed by: rwatson
Reported by: Igor Sysoev
- Remove some KASSERTs which are invalid if the appropriate lock is
not held.
- Slightly restructure bremfree() so that it is more sane.
- Change the flush code in bdwrite() to avoid acquiring a mutex
whenever possible.
- Change the flush code in bdwrite() to avoid holding the bufobj mutex
while calling buf_countdeps(). This introduces a lock-order
relationship with the softdep lock that can not otherwise be resolved.
- Don't set B_DONE until bufdone() is complete, otherwise another
processor may believe the buf is done before it is.
- Only acquire Giant if the caller has set b_iodone. Don't grab giant
around normal bufdone() calls.
Sponsored By: Isilon Systems, Inc.
to off.
- Protect access to mnt_kern_flag with the mointpoint mutex.
- Remove some KASSERTs which are not legal checks without the appropriate
locks held.
- Use VCANRECYCLE() rather than rolling several slightly different
checks together.
- Return from vtryrecycle() with a recycled vnode rather than a locked
vnode. This simplifies some locking.
- Remove several GIANT_REQUIRED lines.
- Add a few KASSERTs to help with INACT debugging.
Sponsored By: Isilon Systems, Inc.
- Protect access to mnt_kern_flag with the mountpoint mutex.
- Use the appropriate nd flags to deal with giant in vn_open_cred().
We currently determine whether the caller is mpsafe by checking
for a valid fdidx. Any caller coming from user-space is now
mpsafe and supplies a valid fd. No kenrel callers have been
converted to mpsafe, so this check is sufficient for now.
- Use VFS_LOCK_GIANT instead of manual giant acquisition where
appropriate.
Sponsored By: Isilon Systems, Inc.
require it.
- Track the status of Giant with the nd flag HASGIANT.
- Release giant on return of namei() callers are not marked MPSAFE as
they already own giant.
Sponsored By: Isilon Systems, Inc.
vnode lock is much simpler than I originally thought it would be.
Now, the cache lock is always acquired before the vnode lock.
- Provide some gotos in __getcwd() to simplify the unlocking a bit.
- Move Giant acquisition down into __getcwd().
Sponsored By: Isilon Systems, Inc.
if the lockmgr interlock is dropped after the caller's interlock
is dropped.
- Change some lockmgr KTRs to be slightly more helpful.
Sponsored By: Isilon Systems, Inc.
witness_proc_has_locks(), as they are unused, which results in a compiler
error. This problem was introduced with the implementation of "show
alllocks".
Spotted by: Artem Kuchin <matrix at itlegion dot ru>
designed to help detect tamper-after-free scenarios, a problem more
and more common and likely with multithreaded kernels where race
conditions are more prevalent.
Currently MemGuard can only take over malloc()/realloc()/free() for
particular (a) malloc type(s) and the code brought in with this
change manually instruments it to take over M_SUBPROC allocations
as an example. If you are planning to use it, for now you must:
1) Put "options DEBUG_MEMGUARD" in your kernel config.
2) Edit src/sys/kern/kern_malloc.c manually, look for
"XXX CHANGEME" and replace the M_SUBPROC comparison with
the appropriate malloc type (this might require additional
but small/simple code modification if, say, the malloc type
is declared out of scope).
3) Build and install your kernel. Tune vm.memguard_divisor
boot-time tunable which is used to scale how much of kmem_map
you want to allott for MemGuard's use. The default is 10,
so kmem_size/10.
ToDo:
1) Bring in a memguard(9) man page.
2) Better instrumentation (e.g., boot-time) of MemGuard taking
over malloc types.
3) Teach UMA about MemGuard to allow MemGuard to override zone
allocations too.
4) Improve MemGuard if necessary.
This work is partly based on some old patches from Ian Dowse.
and always has been, but the system call itself returns
errno in a register so the problem is really a function of
libc, not the system call.
Discussed with : Matthew Dillion <dillon@apollo.backplane.com>
they both happen before pipe backing allocation occurs. Previously,
a pipe memory shortage would cause a panic due to a KNOTE call
on an uninitialized si_note.
Reported by: Peter Holm
MFC after: 1 week
unhappiness lately.
As far as I can tell, no files that have made it safely to disk
have been endangered, but stuff in transit has been in peril.
Pointy hat: phk
and KASSERT coverage.
After this check there is only one "nasty" cast in this code but there
is a KASSERT to protect against the wrong argument structure behind
that cast.
Un-inlining the meat of VOP_FOO() saves 35kB of text segment on a typical
kernel with no change in performance.
We also now run the checking and tracing on VOP's which have been layered
by nullfs, umapfs, deadfs or unionfs.
Add new (non-inline) VOP_FOO_AP() functions which take a "struct
foo_args" argument and does everything the VOP_FOO() macros
used to do with checks and debugging code.
Add KASSERT to VOP_FOO_AP() check for argument type being
correct.
Slim down VOP_FOO() inline functions to just stuff arguments
into the struct foo_args and call VOP_FOO_AP().
Put function pointer to VOP_FOO_AP() into vop_foo_desc structure
and make VCALL() use it instead of the current offsetoff() hack.
Retire vcall() which implemented the offsetoff()
Make deadfs and unionfs use VOP_FOO_AP() calls instead of
VCALL(), we know which specific call we want already.
Remove unneeded arguments to VCALL() in nullfs and umapfs bypass
functions.
Remove unused vdesc_offset and VOFFSET().
Generally improve style/readability of the generated code.
up its pending error state, which may be set in some rare conditions resulting
in connect() syscall returning that bogus error and making application believe
that attempt to change association has failed, while it has not in fact.
There is sockets/reconnect regression test which excersises this bug.
MFC after: 2 weeks
errno can be tampered potentially by nested signal handle.
Now all error codes are returned in negative value, positive value are
reserved for future expansion.
TAILQ_FOREACH_SAFE().
Loose the error pointer argument and return any errors the normal way.
Return EAGAIN for the case where more work needs to be done.
I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:
The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.
Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.
If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.
Discussed with: rwatson
before deciding to do more expensive locking to account for process
exit. This acceptable minor race avoids two mutex operations in
that highly common case of accounting not being enabled.
MFC after: 2 weeks
turn it back on. Specifically, the actual changes are now less intrusive
in that the _get_spin_lock() and _rel_spin_lock() macros now have their
contents changed for UP vs SMP kernels which centralizes the changes.
Also, UP kernels do not use _mtx_lock_spin() and no longer include it. The
UP versions of the spin lock functions do not use any atomic operations,
but simple compares and stores which allow mtx_owned() to still work for
spin locks while removing the overhead of atomic operations.
Tested on: i386, alpha
unloaded, cleanup, or return ebusy of that's inconvenient.' The
default module hanlder for newbus will now call this when we get a
MOD_QUIESCE event, but in the future may call this at other times.
This shouldn't change any actual behavior until drivers start to use it.
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
is the case for most other sysctls in the System V IPC message queue
implementation.
PR: 75541
Submitted by: Sergiy Vyshnevetskiy <serg at vostok dot net>
MFC after: 2 weeks
more general than previous. It also lets me implement cancelable point
in thread library. Also in theory, umtx_lock and umtx_unlock can
be implemented by using umtx_wait and umtx_wake, all atomic operations
can be done in userland without kernel's casuptr() function.
of lock types in the kernel. This results in an increase of witness
data usage from ~145k to ~280k on i386 for kernels with
'options WITNESS'.
- Remove the unused witness malloc bucket.
Submitted by: Michal Mertl mime at traveller dot cz (1)
- Remove the sched_add wrapper that used sched_add_internal() as a backend.
Its only purpose was to interpret one flag and turn it into an int. Do
the right thing and interpret the flag in sched_add() instead.
- Pass the flag argument to sched_add() to kseq_runq_add() so that we can
get the SRQ_PREEMPT optimization too.
- Add a KEF_INTERNAL flag. If KEF_INTERNAL is set we don't adjust the SLOT
counts, otherwise the slot counts are adjusted as soon as we enter
sched_add() or sched_rem() rather than when the thread is actually placed
on the run queue. This greatly simplifies the handling of slots.
- Remove the explicit prevention of migration for ithreads on non-x86
platforms. This was never shown to have any real benefit.
- Remove the unused class argument to KSE_CAN_MIGRATE().
- Add ktr points for thread migration events.
- Fix a long standing bug on platforms which don't initialize the cpu
topology. The ksg_maxid variable was never correctly set on these
platforms which caused the long term load balancer to never inspect
more than the first group or processor.
- Fix another bug which prevented the long term load balancer from working
properly. If stathz != hz we can't expect sched_clock() to be called
on the exact tick count that we're anticipating.
- Rearrange sched_switch() a bit to reduce indentation levels.
and threads currently holding sleep mutexes (and spin mutexes for
curthread). This can be quite useful in looking for a lock condition
summary for a system, as it avoids manually iterating through threads
and processes to find all the interesting locks.
NB: "alllocks" is up there with "lockedvnods" for a bad argument for
show.
MFC after: 2 weeks
lower the priority of the returning thread to a user priority before
calling into thread_userret() which would call wakeup() which in turn would
cause the returning thread to eventually context switch rather than
completing its slice. Allowing this thread to complete its slice first
yields a 15% performance improvement in super-smack on my dual opteron with
4BSD.
substitute for a global mutex protecting the socket count and
generation number.
The observation that soreceive_rcvoob() can't return an mbuf
chain is a property, not a bug, so remove the XXXRW.
In sorflush, s/existing/previous/ for code when describing prior
behavior.
For SO_LINGER socket option retrieval, remove an XXXRW about why
we hold the mutex: this is correct and not dubious.
MFC after: 2 weeks
unnecessary use of a global variable and simplify the return case.
While here, use ()'s around return values.
In sodealloc(), remove a comment about why we bump the gencnt and
decrement the socket count separately. It doesn't add
substantially to the reading, and clutters the function.
MFC after: 2 weeks
occur between a reader and a writer that results in a panic upon close,
e.g.,
"panic: sbflush_locked: cc 4 || mb 0xffffff0052afa400 || mbcnt 0"
Reviewed by: rwatson@
MFC after: 2 weeks
call mmap() to create a shared space, and then initialize umtx on it,
after that, each thread in different processes can use the umtx same
as threads in same process.
2. introduce a new syscall _umtx_op to support timed lock and condition
variable semantics. also, orignal umtx_lock and umtx_unlock inline
functions now are reimplemented by using _umtx_op, the _umtx_op can
use arbitrary id not just a thread id.
nice of 0. Doing so can cause an infinite loop because they should be
running, but a nice -20 process could prevent them from doing so.
- Add a new flag KEF_PRIOELEV to flag a thread that has had its priority
elevated due to priority propagation. If a thread has had its priority
elevated, we assume that it must go on the current queue and it must
get a slice.
- In sched_userret() if our priority was elevated and we shouldn't have
a timeslice, yield here until we should.
Found/Tested by: glebius
which holds on to just the data structure and the mutex. (The
existing refcount (fd_refcnt) holds onto the open files in the
descriptor.)
The fd_holdcnt is protected by fdesc_mtx, fd_refcnt by FILEDESC_LOCK.
Add fdhold(struct proc *) which gets a hold on the filedescriptors of
the specified proc..
Add fddrop(struct filedesc *) which drops the fd_holdcnt and if zero
destroys the mutex and frees the memory.
Initialize the fd_holdcnt to one in fdinit(). Normal operations on
the filedesc structure will not change it.
In fdfree() use fddrop() to dispose of the mutex and structure. Hold
the FILEDESC_LOCK() until we have cleaned out the contents and carefully
set the fields to null values during cleanup.
Use fdhold()/fddrop() in mountcheckdirs() and sysctl_kern_file().
for ensuring that a process' filedesc is not shared with anybody.
Use it in the two places which previously had private implmentations.
This collects all fd_refcnt handling in kern_descrip.c
nice value above 0, set it to 0 so that it may proceed with haste.
This is especially important on ULE, where adjusting the priority
does not guarantee that a thread will be granted a greater time slice.
specifically, vm_pgmoveco():
1. If vm_pgmoveco() sleeps on a busy page, it must redo the look up
because the page may have been freed.
2. If the receive buffer is copy-on-write due to, for example, a fork,
then although the first vm object in the shadow chain may not contain
a page there may still be one from a backing object that is mapped.
Thus, a pmap_remove() is required for the new page rather than the
backing object's page to been seen by the application.
Also, add some comments to vm_pgmoveco() and update some assertions.
Tested by: ken@
completely. For some reason (that I am still curious about) we started to no
longer manage to finish the initialization before the timeouts run the first
time leading to panics when using uninitialized mutex etc.
The root of this problem is that we currently first link a domain to the
domains list and only later initialize the domain's protocols. This should
be reworked in the future, but with the current API it is not possible in
all situations. We settle with this lazy fix for now.
Tested by: gnn, ru, myself
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:
cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()
nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.
ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()
Remove vfs_omount() method, all filesystems are now converted.
Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.
Change rootmounting to use DEVFS trampoline:
vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.
Remove now unnecessary getdiskbyname().
kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.
Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.
Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).
vfs_flagopt() for binary/boolean options.
vfs_getopts() for string options
vfs_filteropt() to check for unknown options.
vfs_scanopt() for scanf() like processing of options.
Also add function for setting the stat.f_mntfromname field.