debug.cpufreq.lowest tunable and sysctl. Some systems seem to have problems
with the lowest frequencies so setting this prevents them from being
available or used.
owned by a process when it forks, and creates a matching set of references
for the child process, as prescribed by POSIX.
In order to avoid races with other threads in the parent process during
fork(), it is necessary to allocate a temporary reference list while
holding the sem_lock, then transfer those references to the new process
once the sem_lock is released. The implementation is inefficient but
appears functional; in order to improve the efficiency, it will be
necessary to modify the existing structures and logic, which generally
rely on O(n) operations over the global set of semaphores.
PAGE_SIZE.
Unlike originator of the PR suggests retain MAXSHELLCMDLEN definition
(he has been proposing to replace it with PAGE_SIZE everywhere), not only
this reduced the diff significantly, but prevents code obfuscation and also
allows to increase/decrease this parameter easily if needed.
PR: kern/64196
Submitted by: Magnus Bäckström <b@etek.chalmers.se>
doesn't change functionality, but makes code more logical.
Obtained from: DrafonFlyBSD
o Use VOP_GETATTR() to obtain actual size of file and parse no more than that.
Previously, we parsed MAXSHELLCMDLEN characters regardless of the actual file
size. This makes the following working:
$ printf '#!/bin/echo' > /tmp/test.sh
$ chmod 755 /tmp/test.sh
$ /tmp/test.sh
Previously, attempts to execve() that shell script has been failing with bogus
ENAMETOOLONG.
PR: kern/64196
Submitted by: Magnus B.ckstr.m <b@etek.chalmers.se>
script. Otherwise it's possible to panic kernel by constructing a shell
script with first line not ending in '\n'.
Also, treat '\0' as line terminating character, which may me useful in
some situations.
Submitted by: gad
when trim'ing space off the back of a chain; this is indirect
solution to a potential null ptr deref
Noticed by: Coverity Prevent analysis tool (null ptr deref)
Reviewed by: dg, rwatson
vn_extattr_rm. This is meant to catch conditions where IO_NODELOCKED
has been specified without the vnode being locked.
Discussed with: rwatson
MFC after: 1 week
object on to the zone allocator. It should be noted that uma_zalloc(9)
uses bzero to zero out the object so there probably wont be any
real performance benefit. If UMA grows the ability to supply
zeroed zones more efficiently in the future, we will not have to
modify all the existing consumers.
Discussed with: rwatson,julian
MFC after: 1 week
has been set. Assert that this is the case so that we catch filesystems
who are using naked VOP_LOCKs in illegal cases.
Sponsored by: Isilon Systems, Inc.
considered to be as good as an exclusive lock, although there is still a
possibility of someone acquiring a VOP LOCK while xlock is held.
Sponsored by: Isilon Systems, Inc.
List devfs_dirents rather than vnodes off their shared struct cdev, this
saves a pointer field in the vnode at the expense of a field in the
devfs_dirent. There are often 100 times more vnodes so this is bargain.
In addition it makes it harder for people to try to do stypid things like
"finding the vnode from cdev".
Since DEVFS handles all VCHR nodes now, we can do the vnode related
cleanup in devfs_reclaim() instead of in dev_rel() and vgonel().
Similarly, we can do the struct cdev related cleanup in dev_rel()
instead of devfs_reclaim().
rename idestroy_dev() to destroy_devl() for consistency.
Add LIST_ENTRY de_alias to struct devfs_dirent.
Remove v_specnext from struct vnode.
Change si_hlist to si_alist in struct cdev.
String new devfs vnodes' devfs_dirent on si_alist when
we create them and take them off in devfs_reclaim().
Fix devfs_revoke() accordingly. Also don't clear fields
devfs_reclaim() will clear when called from vgone();
Let devfs_reclaim() call dev_rel() instead of vgonel().
Move the usecount tracking from dev_rel() to devfs_reclaim(),
and let dev_rel() take a struct cdev argument instead of vnode.
Destroy SI_CHEAPCLONE devices in dev_rel() (instead of
devfs_reclaim()) when they are no longer used. (This
should maybe happen in devfs_close() instead.)
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
only call the protocol's pru_rcvd() if the protocol has the flag
PR_WANTRCVD set. This brings that instance of pru_rcvd() into line with
the rest, which do check the flag.
MFC after: 3 days
of the global UNIX domain socket mutex: no protection is needed that
early in the setup of the UNIX domain socket and socket structures.
MFC after: 3 days
patch from kan@).
Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
so->so_options when solisten() will succeed, rather than setting it
conditionally based on there not being queued sockets in the completed
socket queue. Otherwise, if the protocol exposes new sockets via the
completed queue before solisten() completes, the listen() system call
will succeed, but the socket and protocol state will be out of sync.
For TCP, this didn't happen in practice, as the TCP code will panic if
a new connection comes in after the tcpcb has been transitioned to a
listening state but the socket doesn't have SO_ACCEPTCONN set.
This is historical behavior resulting from bitrot since 4.3BSD, in which
that line of code was associated with the conditional NULL'ing of the
connection queue pointers (one-time initialization to be performed
during the transition to a listening socket), which are now initialized
separately.
Discussed with: fenner, gnn
MFC after: 3 days
driver. This used to be handled by cpufreq_drv_settings() but it's
useful to get the type/flags separately from getting the settings.
(For example, you don't have to pass an array of cf_setting just to find
the driver type.)
Use this new method in our in-tree drivers to detect reliably if acpi_perf
is present and owns the hardware. This simplifies logic in drivers as well
as fixing a bug introduced in my last commit where too many drivers attached.
soref() to also covering the update of so_state. While no other user
threads can update the socket state here as it's not yet hooked up to
the file descriptor array yet, the protocol could also frob the
socket state here, leading to a lost update to the so_state field.
No reported instances of this bug (as yet).
MFC after: 3 days
connection status before inserting the new socket into the listen
socket's accept queue, or there might be a race in which another thread
wakes up when the accept lock is released, and sees the socket before its
state is set correctly. The wakeup still occurs after the accept lock is
released. There have been no diagnoses of this bug in real-world systems
(as yet).
MFC after: 3 days
statement from some files, so re-add it for the moment, until the
related legalese is sorted out. This change affects:
sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h
the rate for the 100% state once. Afterwards, use that value for deriving
states. This should fix the problem where the calibrated frequency was
different once a switch was done, giving a different set of levels each
time. Also, properly search for the right cpufreqX device when detaching.
override the current freq level temporarily and restore it when the
higher priority condition is past. Note that only the first overridden
value is saved. Callers pass NULL to CPUFREQ_SET to restore the saved
level. Priorities are not yet used so this commit should have no effect.
are not added to the list(s) of available settings. However, other drivers
can call the CPUFREQ_DRV_SETTINGS() method on those devices directly to
get info about available settings.
Update the acpi_perf(4) driver to use this flag in the presence of
"functional fixed hardware." Thus, future drivers like Powernow can
query acpi_perf for platform info but perform frequency transitions
themselves.
on dev.cpu.0 will affect all of the CPUs together. In the future,
independent control will be supported but this is good enough for now.
Check that the timecounter isn't TSC before switching (from Colin Percival.)
former is callable from user space and the latter from the kernel one. Make
kernel version take additional argument which tells if the respective call
should check for additional restrictions for sending signals to suid/sugid
applications or not.
Make all emulation layers using non-checked version, since signal numbers in
emulation layers can have different meaning that in native mode and such
protection can cause misbehaviour.
As a result remove LIBTHR from the signals allowed to be delivered to a
suid/sugid application.
Requested (sorta) by: rwatson
MFC after: 2 weeks
This information will be very useful for people who are tuning applications
which have a dependence on IPC mechanisms.
The following OIDs were documented:
Message queues:
kern.ipc.msgmax
kern.ipc.msgmni
kern.ipc.msgmnb
kern.ipc.msgtlq
kern.ipc.msgssz
kern.ipc.msgseg
Semaphores:
kern.ipc.semmap
kern.ipc.semmni
kern.ipc.semmns
kern.ipc.semmnu
kern.ipc.semmsl
kern.ipc.semopm
kern.ipc.semume
kern.ipc.semusz
kern.ipc.semvmx
kern.ipc.semaem
Shared memory:
kern.ipc.shmmax
kern.ipc.shmmin
kern.ipc.shmmni
kern.ipc.shmseg
kern.ipc.shmall
kern.ipc.shm_use_phys
kern.ipc.shm_allow_removed
kern.ipc.shmsegs
These new descriptions can be viewed using sysctl -d
PR: kern/65219
Submitted by: Dan Nelson <dnelson at allantgroup dot com> (modified)
No objections: developers@
Descriptions reviewed by: gnn
MFC after: 1 week
suid application. The problem is that Linux applications using old Linux
threads (pre-NPTL) use signal 32 (linux SIGRTMIN) for communication between
thread-processes. If such an linux application is installed suid or sgid
and security.bsd.conservative_signals=1 (default), then permission will be
denied to send such a signal and the application will freeze.
I believe the same will be true for native applications that use libthr,
since libthr uses SIGTHR for implementing conditional variables.
PR: 72922
Submitted by: Andriy Gapon <avg@icyb.net.ua>
MFC after: 2 weeks
list, set `curr_callout' to NULL. This ensures that we won't attempt
to cancel the current callout if the original callout structure
gets recycled while we wait to acquire Giant.
This is reported to fix an intermittent syscons problem that was
introduced by revision 1.96.
do not need to perform an extra memory fetch in the Packet (Mbuf+Cluster)
constructor to initialize the reference counter anymore. The reference
counts are located in a separate memory region (in the slab header,
because this zone is UMA_ZONE_REFCNT), so the memory fetch resulted very
often in a cache miss. Additionally, and perhaps more significantly,
optimize the free mbuf+cluster (packet) case, which is very common, to
no longer require an atomic operation on free (to verify the reference
counter) if the reference on the cluster has never been increased (also
very common). Reduces an atomic on mbuf free on average.
Original patch submitted by: Gerrit Nagelhout <gnagelhout@sandvine.com>
behaviour of chflags within a jail. If set to 0 (the default), then a
jailed root user is treated as an unprivileged user; if set to 1, then
a jailed root user is treated the same as an unjailed root user.
This is necessary to allow "make installworld" to work inside a jail,
since it attempts to manipulate the system immutable flag on certain
files.
Discussed with: csjp, rwatson
MFC after: 2 weeks
Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.
Remove all the background write related stuff from the normal bufwrite.
This drags the softdep_move_dependencies() back into FFS.
Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.
structure in the struct pointed to by the 3rd argument for IPC_STAT and
get rid of the 4th argument. The old way returned a pointer into the
kernel array that the calling function would then access afterwards
without holding the appropriate locks and doing non-lock-safe things like
copyout() with the data anyways. This change removes that unsafeness and
resulting race conditions as well as simplifying the interface.
- Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(),
fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of
code duplication for freebsd4 and netbsd compatability system calls.
- Add a new lookup function kern_alternate_path() that looks up a filename
under an alternate prefix and determines which filename should be used.
This is basically a more general version of linux_emul_convpath() that
can be shared by all the ABIs thus allowing for further reduction of
code duplication.
callout is first initialised, using a new function callout_init_mtx().
The callout system will acquire this mutex before calling the callout
function and release it on return.
In addition, the callout system uses the mutex to avoid most of the
complications and race conditions inherent in asynchronous timer
facilities, so mutex-protected callouts have much simpler semantics.
As long as the mutex is held when invoking callout_stop() or
callout_reset(), then these functions will guarantee that the callout
will be stopped, even if softclock() had already begun to process
the callout.
Existing Giant-locked callouts will automatically pick up the new
race-free semantics. This should close a number of race conditions
in the USB code and probably other areas of the kernel too.
There should be no change in behaviour for "MP-safe" callouts; these
still need to use the techniques mentioned in timeout(9) to avoid
race conditions.
frequency as a percentage of the base rate and do not change the base
rate directly. The cpufreq framework combines these with absolute drivers
to produce synthesized levels made of one or more settings.
select the CPU frequency level (say for cooling). The driver interface
allows hardware drivers to announce themselves as capable of adjusting
an individual frequency setting.
- Add buffer size limitations (overflow will not be possible anymore).
- Add 'visible' option, which will allow for passphrase reading in the
future.
- Remove special treatment of '@' and '#', those two are only confusing.
Discussed with: rwatson
MFC after: 2 weeks
tond and not fromnd. This could lead us to leak Giant, or unlock it
twice, depending on the filesystems involved. renames within a single
filesystem would not have caused any problems.
Sponsored by: Isilon Systems, Inc.
all reserved, as the lisence makes clear), and strike the third clause
(now this is a 2-clause liberal BSDL as are the rest of files I hold
copyright over).
copies arguments into the kernel space and one that operates
completely in the kernel space;
o use kernel-only version of execve(2) to kill another stackgap in
linuxlator/i386.
Obtained from: DragonFlyBSD (partially)
MFC after: 2 weeks
Add minor2unit() in addition to dev2unit() and unit2minor().
If it wasn't such a hazzle we should redefine minor numbers in
the kernel without the gap for the major number, but it's not worth
the bother (yet).
a process return to userspace if it had pending GEOM events.
We need to have the same check in the exit pass to catch the case
where a GEOM related filedescriptor is not explicitly closed by
the process.
Bumped into by: people using dd(1) to build releases, nanobsd etc.
from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrapper
of that syscall.
MFC after: 2 weeks
pops data from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrappers
of those syscalls.
MFC after: 2 weeks
missed that when the vnode bypass was introduced.
Deal with zero length transfers before we even get to fo_ops->fo_read().
Found by: Slawa Olhovchenkov <slwzxy.spb.ru@zxy.spb.ru>
PR: 75758
the name Sande^H^H^H^H^Hvnode_create_vobject().
Make the new function take a size argument which removes the need for
a VOP_STAT() or a very pessimistic guess for disks.
Call that new function from vop_stdcreatevobject().
Make vnode_pager_alloc() private now that its only user came home.
short to unsigned short.
- Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX.
Before this change setting somaxconn to smth above 32767 and calling
listen(fd, -1) lead to a socket, which doesn't accept connections at all.
Reviewed by: rwatson
Reported by: Igor Sysoev
- Remove some KASSERTs which are invalid if the appropriate lock is
not held.
- Slightly restructure bremfree() so that it is more sane.
- Change the flush code in bdwrite() to avoid acquiring a mutex
whenever possible.
- Change the flush code in bdwrite() to avoid holding the bufobj mutex
while calling buf_countdeps(). This introduces a lock-order
relationship with the softdep lock that can not otherwise be resolved.
- Don't set B_DONE until bufdone() is complete, otherwise another
processor may believe the buf is done before it is.
- Only acquire Giant if the caller has set b_iodone. Don't grab giant
around normal bufdone() calls.
Sponsored By: Isilon Systems, Inc.
to off.
- Protect access to mnt_kern_flag with the mointpoint mutex.
- Remove some KASSERTs which are not legal checks without the appropriate
locks held.
- Use VCANRECYCLE() rather than rolling several slightly different
checks together.
- Return from vtryrecycle() with a recycled vnode rather than a locked
vnode. This simplifies some locking.
- Remove several GIANT_REQUIRED lines.
- Add a few KASSERTs to help with INACT debugging.
Sponsored By: Isilon Systems, Inc.
- Protect access to mnt_kern_flag with the mountpoint mutex.
- Use the appropriate nd flags to deal with giant in vn_open_cred().
We currently determine whether the caller is mpsafe by checking
for a valid fdidx. Any caller coming from user-space is now
mpsafe and supplies a valid fd. No kenrel callers have been
converted to mpsafe, so this check is sufficient for now.
- Use VFS_LOCK_GIANT instead of manual giant acquisition where
appropriate.
Sponsored By: Isilon Systems, Inc.
require it.
- Track the status of Giant with the nd flag HASGIANT.
- Release giant on return of namei() callers are not marked MPSAFE as
they already own giant.
Sponsored By: Isilon Systems, Inc.
vnode lock is much simpler than I originally thought it would be.
Now, the cache lock is always acquired before the vnode lock.
- Provide some gotos in __getcwd() to simplify the unlocking a bit.
- Move Giant acquisition down into __getcwd().
Sponsored By: Isilon Systems, Inc.
if the lockmgr interlock is dropped after the caller's interlock
is dropped.
- Change some lockmgr KTRs to be slightly more helpful.
Sponsored By: Isilon Systems, Inc.
witness_proc_has_locks(), as they are unused, which results in a compiler
error. This problem was introduced with the implementation of "show
alllocks".
Spotted by: Artem Kuchin <matrix at itlegion dot ru>
designed to help detect tamper-after-free scenarios, a problem more
and more common and likely with multithreaded kernels where race
conditions are more prevalent.
Currently MemGuard can only take over malloc()/realloc()/free() for
particular (a) malloc type(s) and the code brought in with this
change manually instruments it to take over M_SUBPROC allocations
as an example. If you are planning to use it, for now you must:
1) Put "options DEBUG_MEMGUARD" in your kernel config.
2) Edit src/sys/kern/kern_malloc.c manually, look for
"XXX CHANGEME" and replace the M_SUBPROC comparison with
the appropriate malloc type (this might require additional
but small/simple code modification if, say, the malloc type
is declared out of scope).
3) Build and install your kernel. Tune vm.memguard_divisor
boot-time tunable which is used to scale how much of kmem_map
you want to allott for MemGuard's use. The default is 10,
so kmem_size/10.
ToDo:
1) Bring in a memguard(9) man page.
2) Better instrumentation (e.g., boot-time) of MemGuard taking
over malloc types.
3) Teach UMA about MemGuard to allow MemGuard to override zone
allocations too.
4) Improve MemGuard if necessary.
This work is partly based on some old patches from Ian Dowse.
and always has been, but the system call itself returns
errno in a register so the problem is really a function of
libc, not the system call.
Discussed with : Matthew Dillion <dillon@apollo.backplane.com>
they both happen before pipe backing allocation occurs. Previously,
a pipe memory shortage would cause a panic due to a KNOTE call
on an uninitialized si_note.
Reported by: Peter Holm
MFC after: 1 week
unhappiness lately.
As far as I can tell, no files that have made it safely to disk
have been endangered, but stuff in transit has been in peril.
Pointy hat: phk
and KASSERT coverage.
After this check there is only one "nasty" cast in this code but there
is a KASSERT to protect against the wrong argument structure behind
that cast.
Un-inlining the meat of VOP_FOO() saves 35kB of text segment on a typical
kernel with no change in performance.
We also now run the checking and tracing on VOP's which have been layered
by nullfs, umapfs, deadfs or unionfs.
Add new (non-inline) VOP_FOO_AP() functions which take a "struct
foo_args" argument and does everything the VOP_FOO() macros
used to do with checks and debugging code.
Add KASSERT to VOP_FOO_AP() check for argument type being
correct.
Slim down VOP_FOO() inline functions to just stuff arguments
into the struct foo_args and call VOP_FOO_AP().
Put function pointer to VOP_FOO_AP() into vop_foo_desc structure
and make VCALL() use it instead of the current offsetoff() hack.
Retire vcall() which implemented the offsetoff()
Make deadfs and unionfs use VOP_FOO_AP() calls instead of
VCALL(), we know which specific call we want already.
Remove unneeded arguments to VCALL() in nullfs and umapfs bypass
functions.
Remove unused vdesc_offset and VOFFSET().
Generally improve style/readability of the generated code.
up its pending error state, which may be set in some rare conditions resulting
in connect() syscall returning that bogus error and making application believe
that attempt to change association has failed, while it has not in fact.
There is sockets/reconnect regression test which excersises this bug.
MFC after: 2 weeks
errno can be tampered potentially by nested signal handle.
Now all error codes are returned in negative value, positive value are
reserved for future expansion.
TAILQ_FOREACH_SAFE().
Loose the error pointer argument and return any errors the normal way.
Return EAGAIN for the case where more work needs to be done.
I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:
The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.
Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.
If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.
Discussed with: rwatson
before deciding to do more expensive locking to account for process
exit. This acceptable minor race avoids two mutex operations in
that highly common case of accounting not being enabled.
MFC after: 2 weeks
turn it back on. Specifically, the actual changes are now less intrusive
in that the _get_spin_lock() and _rel_spin_lock() macros now have their
contents changed for UP vs SMP kernels which centralizes the changes.
Also, UP kernels do not use _mtx_lock_spin() and no longer include it. The
UP versions of the spin lock functions do not use any atomic operations,
but simple compares and stores which allow mtx_owned() to still work for
spin locks while removing the overhead of atomic operations.
Tested on: i386, alpha
unloaded, cleanup, or return ebusy of that's inconvenient.' The
default module hanlder for newbus will now call this when we get a
MOD_QUIESCE event, but in the future may call this at other times.
This shouldn't change any actual behavior until drivers start to use it.
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.
Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.
Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64
is the case for most other sysctls in the System V IPC message queue
implementation.
PR: 75541
Submitted by: Sergiy Vyshnevetskiy <serg at vostok dot net>
MFC after: 2 weeks
more general than previous. It also lets me implement cancelable point
in thread library. Also in theory, umtx_lock and umtx_unlock can
be implemented by using umtx_wait and umtx_wake, all atomic operations
can be done in userland without kernel's casuptr() function.
of lock types in the kernel. This results in an increase of witness
data usage from ~145k to ~280k on i386 for kernels with
'options WITNESS'.
- Remove the unused witness malloc bucket.
Submitted by: Michal Mertl mime at traveller dot cz (1)
- Remove the sched_add wrapper that used sched_add_internal() as a backend.
Its only purpose was to interpret one flag and turn it into an int. Do
the right thing and interpret the flag in sched_add() instead.
- Pass the flag argument to sched_add() to kseq_runq_add() so that we can
get the SRQ_PREEMPT optimization too.
- Add a KEF_INTERNAL flag. If KEF_INTERNAL is set we don't adjust the SLOT
counts, otherwise the slot counts are adjusted as soon as we enter
sched_add() or sched_rem() rather than when the thread is actually placed
on the run queue. This greatly simplifies the handling of slots.
- Remove the explicit prevention of migration for ithreads on non-x86
platforms. This was never shown to have any real benefit.
- Remove the unused class argument to KSE_CAN_MIGRATE().
- Add ktr points for thread migration events.
- Fix a long standing bug on platforms which don't initialize the cpu
topology. The ksg_maxid variable was never correctly set on these
platforms which caused the long term load balancer to never inspect
more than the first group or processor.
- Fix another bug which prevented the long term load balancer from working
properly. If stathz != hz we can't expect sched_clock() to be called
on the exact tick count that we're anticipating.
- Rearrange sched_switch() a bit to reduce indentation levels.
and threads currently holding sleep mutexes (and spin mutexes for
curthread). This can be quite useful in looking for a lock condition
summary for a system, as it avoids manually iterating through threads
and processes to find all the interesting locks.
NB: "alllocks" is up there with "lockedvnods" for a bad argument for
show.
MFC after: 2 weeks
lower the priority of the returning thread to a user priority before
calling into thread_userret() which would call wakeup() which in turn would
cause the returning thread to eventually context switch rather than
completing its slice. Allowing this thread to complete its slice first
yields a 15% performance improvement in super-smack on my dual opteron with
4BSD.
substitute for a global mutex protecting the socket count and
generation number.
The observation that soreceive_rcvoob() can't return an mbuf
chain is a property, not a bug, so remove the XXXRW.
In sorflush, s/existing/previous/ for code when describing prior
behavior.
For SO_LINGER socket option retrieval, remove an XXXRW about why
we hold the mutex: this is correct and not dubious.
MFC after: 2 weeks
unnecessary use of a global variable and simplify the return case.
While here, use ()'s around return values.
In sodealloc(), remove a comment about why we bump the gencnt and
decrement the socket count separately. It doesn't add
substantially to the reading, and clutters the function.
MFC after: 2 weeks
occur between a reader and a writer that results in a panic upon close,
e.g.,
"panic: sbflush_locked: cc 4 || mb 0xffffff0052afa400 || mbcnt 0"
Reviewed by: rwatson@
MFC after: 2 weeks
call mmap() to create a shared space, and then initialize umtx on it,
after that, each thread in different processes can use the umtx same
as threads in same process.
2. introduce a new syscall _umtx_op to support timed lock and condition
variable semantics. also, orignal umtx_lock and umtx_unlock inline
functions now are reimplemented by using _umtx_op, the _umtx_op can
use arbitrary id not just a thread id.
nice of 0. Doing so can cause an infinite loop because they should be
running, but a nice -20 process could prevent them from doing so.
- Add a new flag KEF_PRIOELEV to flag a thread that has had its priority
elevated due to priority propagation. If a thread has had its priority
elevated, we assume that it must go on the current queue and it must
get a slice.
- In sched_userret() if our priority was elevated and we shouldn't have
a timeslice, yield here until we should.
Found/Tested by: glebius
which holds on to just the data structure and the mutex. (The
existing refcount (fd_refcnt) holds onto the open files in the
descriptor.)
The fd_holdcnt is protected by fdesc_mtx, fd_refcnt by FILEDESC_LOCK.
Add fdhold(struct proc *) which gets a hold on the filedescriptors of
the specified proc..
Add fddrop(struct filedesc *) which drops the fd_holdcnt and if zero
destroys the mutex and frees the memory.
Initialize the fd_holdcnt to one in fdinit(). Normal operations on
the filedesc structure will not change it.
In fdfree() use fddrop() to dispose of the mutex and structure. Hold
the FILEDESC_LOCK() until we have cleaned out the contents and carefully
set the fields to null values during cleanup.
Use fdhold()/fddrop() in mountcheckdirs() and sysctl_kern_file().
for ensuring that a process' filedesc is not shared with anybody.
Use it in the two places which previously had private implmentations.
This collects all fd_refcnt handling in kern_descrip.c
nice value above 0, set it to 0 so that it may proceed with haste.
This is especially important on ULE, where adjusting the priority
does not guarantee that a thread will be granted a greater time slice.
specifically, vm_pgmoveco():
1. If vm_pgmoveco() sleeps on a busy page, it must redo the look up
because the page may have been freed.
2. If the receive buffer is copy-on-write due to, for example, a fork,
then although the first vm object in the shadow chain may not contain
a page there may still be one from a backing object that is mapped.
Thus, a pmap_remove() is required for the new page rather than the
backing object's page to been seen by the application.
Also, add some comments to vm_pgmoveco() and update some assertions.
Tested by: ken@