Non-NULL timeouts where copied in improperly and could produce failures
due to incompatible data structures.
Reviewed by: kib
MFC after: 3 days
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14587
Normally after grabbing the lock it has to be verified we got the right one
to begin with. However, if we are recursing, it must not change thus the
check can be avoided. In particular this avoids a lock read for non-recursing
case which found out the lock was changed.
While here avoid an irq trip of this happens.
Tested by: pho (previous version)
The code already pays the cost of reading the lock to obtain the waiters
flag. Checking whether there is more than one reader is not a problem and
avoids dirtying the line.
This also fixes a small corner case: if waiters were to show up between
reading the flag and upgrading the lock, the operation would fail even
though it should not. No correctness change here though.
If there were exactly rowner_retries/asx_retries (by default: 10) transitions
between read and write state and the waiters still did not get the lock, the
next owner -> reader transition would result in the code correctly falling
back to turnstile/sleepq where it would incorrectly think it was waiting
for a writer and decide to leave turnstile/sleepq to loop back. From this
point it would take ts/sq trips until the lock gets released.
The bug sometimes manifested itself in stalls during -j 128 package builds.
Refactor the code to fix the bug, while here remove some of the gratituous
differences between rw and sx locks.
The main routine takes 8 args, 3 of which are almost the same for most uses.
This in particular pushes it above the limit of 6 arguments passable through
registers on amd64 making it impossible to tail call.
This is a prerequisite for further cleanups.
Tested by: pho
This deliberately breaks the API in preperation for future syscall
revisions which will remove these nonstandard members.
In an exp-run a single port (devel/qemu-user-static) was found to
use them which it did becuase it emulates system calls. This has
been fixed in the ports tree.
PR: 224443 (exp-run)
Reviewed by: kib, jhb (previous version)
Exp-run by: antoine
Sponsored by: DARPA, AFRP
Differential Revision: https://reviews.freebsd.org/D14490
The first call is used to gauge how much spaces is needed. Just computing
the size instead of generating the output allows to not take the proctree
lock.
use it to regulate page daemon output.
This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.
Reviewed by: bsdimp
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14402
search for a thread to steal inside a critical section. Since this
allows the search to be preempted, restart the search if preemption
happens since the search results found earlier may no longer be
valid.
Decrease the latency of starting a thread that may be assigned to
this CPU during the search by polling for incoming threads during
the search and switching to that thread instead of continuing the
search.
Test for stale search results and restart the search before going
through the expense of calling tdq_lock_pair(). Retry some tests
after grabbing the locks since things may have changed while waiting
to get both locks.
Eliminate special case handling for stealing from an SMT peer that
uses 1 as the steal threshold. This can only succeed if a thread
has been assigned but our SMT peer has not yet started executing
it. This is quite rare and when it happens the other SMT thread
is generally waiting for the same tdq lock that we hold. Basically
both SMT threads are racing to grab the same spin lock.
Add the kern.sched.always_steal knob from a ULE patch by jeff@.
Incorporate another idea from Jeff's ULE patch. If the sched_switch()
detects that the CPU is about to go idle, try to steal a thread
before switching to the idle thread. Since the search for a thread
to steal has to be done inside a critical section in this context,
limit the impact on latency by adding the knob kern.sched.trysteal_limit
to limit the topological distance of the search and don't restart
the search if we detect stale results. If this search can't find
an stealable thread, the idle loop can do a more complete search.
Also poll for threads being assigned to this CPU during the search
and switch to them instead of continuing the search. This change
is responsibile for the majority of the improvement in parallel
buildworld times.
In sched_balance_group() change the minimum threshold from stealing
a thread from 1 to 2. Poaching a newly assigned thread from a CPU
that is waking up hasn't yet switched to that thread from idle is
likely very rare and is likely to have the same lock race as is
seen when stealing threads in the idle loop. Also use tdq_notify()
to kick the destintation CPU instead of always sending an IPI.
Update a stale comment, the number of transferable threads is not
calculated.
Reviewed by: kib (earlier version)
Comments by: avg, jeff, mav
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12130
Remove the unused syscall_(de)register() functions in favor of the
better documented and easier to use syscall_helper_(un)register(9)
functions.
The default and freebsd32 versions differed in which array of struct
sysents they used and a few missing updates to the 32-bit code as
features were added to the main code.
Reviewed by: cem
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14337
There is a proctree -> allproc ordering established.
Most of the time it is either xlock -> xlock or slock -> slock.
On fork however there is a slock -> xlock pair which results in
pathological wait times due to threads keeping proctree held for
reading and all waiting on allproc. Switch this to xlock -> xlock.
Longer term fix would get rid of proctree in this place to begin with.
Right now it is necessary to walk the session/process group lists to
determine which id is free. The walk can be avoided e.g. with bitmaps.
The exit path used to have one place which dealt with allproc and
then with proctree. Move the allproc acquire into the section protected
by proctree. This reduces contention against threads waiting on proctree
in the fork codepath - the fork proctree holder does not have to wait
for allproc as often.
Finally, move tidhash manipulation outside of the area protected by
either of these locks. The removal from the hash was already unprotected.
There is no legitimate reason to look up thread ids for a process still
under construction.
This results in about 50% wait time reduction during -j 128 package build.
Provide multiple clean queues partitioned into 'domains'. Each domain manages
its own bufspace and has its own bufspace daemon. Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.
Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.
Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.
Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.
Reviewed by: markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14274
The race manifested itself mostly in terms of crashes with "spin lock
held too long".
Relevant parts of respective code paths:
exit: reap:
PROC_LOCK(p);
PROC_SLOCK(p);
p->p_state == PRS_ZOMBIE
PROC_UNLOCK(p);
PROC_LOCK(p);
/* exit work */
if (p->p_state == PRS_ZOMBIE) /* true */
proc_reap()
free proc
/* more exit work */
PROC_SUNLOCK(p);
Thus a still exiting process is reaped.
Prior to the change the zombie check was followed by slock/sunlock trip
which prevented the problem.
Even code prior to this commit has a bug: the proc is still accessed for
statistic collection purposes. However, the severity is rather small and
the bug may be fixed in a future commit.
Reported by: many
Tested by: allanjude
The primitive can be used to wait for the lock to be released. Intended
usage is for locks in structures which are about to be freed.
The benefit is the avoided interrupt enable/disable trip + atomic op to
grab the lock and shorter wait if the lock is held (since there is no
worry someone will contend on the lock, re-reads can be more aggressive).
Briefly discussed with: kib
The suspension counter needs synchronisation through slock, but we don't
need it to check if inspecting the counter is necessary to begin with.
In the common case it is not, thus avoid the lock if possible.
Reviewed by: kib
Tested by: pho
The description of kern.ipc.shmsegs was wrong since 2005. I updated the
others (which were more correct) to match.
PR: 225933
Reviewed by: cem
MFC after: 3 days
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14391
With the option used to compile the kernel both sx and rw shared ops would
always go to the slow path which added avoidable overhead even when the
facility is disabled.
Furthermore the increased time spent doing uncontested shared lock acquire
would be bogusly added to total wait time, somewhat skewing the results.
Restore old behaviour of going there only when profiling is enabled.
This change is a no-op for kernels without LOCK_PROFILING (which is the
default).
Add const to new kern_ functions and push down as required.
Reviewed by: rwatson
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14174
sbt is the time in the future that the tsleep_sbt() is expected to be completed
at. sbtt is the current time. Depending on the precision with sysctl
kern.timecounter.alloweddeviation the start time may be incremented by
tc_tick_sbt. The same increment is needed for the current time of sbtt before
calculating the difference. The impact of missing this increment is that rmtp
may increase by one tc_tick_sbt on every early [EINTR] return. If the same
struct is passed in for rqtp as rmtp this can result in rqtp effectively
incrementing by tc_tick_sbt and sleeping longer than originally intended.
This problem was introduced in r247797.
Reviewed by: kib, markj, vangyzen (all on an older version of the test)
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D14362
This works similarly to the existing gzip compression support, but
zstd is typically faster and gives better compression ratios.
Support for this functionality must be configured by adding ZSTDIO to
one's kernel configuration file. dumpon(8)'s new -Z option is used to
configure zstd compression for kernel dumps. savecore(8) now recognizes
and saves zstd-compressed kernel dumps with a .zst extension.
Submitted by: cem (original version)
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13101,
https://reviews.freebsd.org/D13633