Lockstat requires checking if it is enabled and if so, calling a 6 argument
function. Further, determining whether to call it on unlock requires
pre-reading the lock value.
This is problematic in at least 3 ways:
- more branches in the hot path than necessary
- additional cacheline ping pong under contention
- bigger code
Instead, check first if lockstat handling is necessary and if so, just fall
back to regular locking routines. For this purpose a new macro is introduced
(LOCKSTAT_PROFILE_ENABLED).
LOCK_PROFILING uninlines all primitives. Fold in the current inline lock
variant into the _mtx_lock_flags to retain the support. With this change
the inline variants are not used when LOCK_PROFILING is defined and thus
can ignore its existence.
This results in:
text data bss dec hex filename
22259667 1303208 4994976 28557851 1b3c21b kernel.orig
21797315 1303208 4994976 28095499 1acb40b kernel.patched
i.e. about 3% reduction in text size.
A remaining action is to remove spurious arguments for internal kernel
consumers.
Shared locking routines explicitly read the value and test it. If the
change attempt fails, they fall back to a regular function which would
retry in a loop.
The problem is that with many concurrent readers the risk of failure is pretty
high and even the value returned by fcmpset is very likely going to be stale
by the time the loop in the fallback routine is reached.
Uninline said primitives. It gives a throughput increase when doing concurrent
slocks/sunlocks with 80 hardware threads from ~50 mln/s to ~56 mln/s.
Interestingly, rwlock primitives are already not inlined.
The found value is passed to locking routines in order to reduce cacheline
accesses.
mtx_unlock grows an explicit check for regular unlock. On ll/sc architectures
the routine can fail even if the lock could have been handled by the inline
primitive.
Discussed with: jhb
Tested by: pho (previous version)
witness_warn() either breaks into the debugger or panics the system, so its
output should go to the console regardless of the witness(4) output channel
configuration.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.
Reported by: kan, lidl
for EVFILT_READ at the point of the check not when the event is registers.
This fixes a problem with asio when accepting a connection.
Reviewed by: kib@, Scott Mitchell
of their sys_*() counterparts. The svr4 is left unchanged.
Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9379
The checks have quadratic complexity over a number of advisory locks
active for a file and that could be a lot. What's the worse is that the
checks are done while holding ls_lock. That could lead to a long a very
long backlog and performance degradation even if all requested locks are
compatible (e.g. all shared locks).
The checks used to be under INVARIANTS.
Discussed with: kib
MFC after: 2 weeks
Sponsored by: Panzura
instead of their sys_*() counterparts in various compats. The svr4
is left untouched, because there's no point.
Reviewed by: ed@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9367
Add additionally safety and overflow checks to clock_ts_to_ct and the
BCD routines while we're here.
Perform a safety check in sys_clock_settime() first to avoid easy local
root panic, without having to propagate an error value back through
dozens of APIs currently lacking error returns.
PR: 211960, 214300
Submitted by: Justin McOmie <justin.mcomie at gmail.com>, kib@
Reported by: Tim Newsham <tim.newsham at nccgroup.trust>
Reviewed by: kib@
Sponsored by: Dell EMC Isilon, FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9279
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reported by: Matt Macy <mmacy@nextbsd.org>
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reviewed by: Matt Macy <mmacy@nextbsd.org>
critical_exit().
Based on the discussion with: jhb
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential revision: D9276
MFC after: 1 week
Stop testing for LK_RETRY and error multiple times. Also postpone the
VI_DOOMED until after LK_RETRY was seen as it reads from the vnode.
No functional changes.
configtimer().
During normal operation "state->nextcallopt" will always be less than
or equal to "state->nextcall" and checking only "state->nextcallopt"
before calling "callout_process()" is sufficient. However when
"configtimer()" is called a race might happen requiring both of these
binary times to be checked.
Short description of race:
1) A configtimer() call will reset both "state->nextcall" and
"state->nextcallopt" to the same binary time.
2) If a "callout_reset()" call happens between "configtimer()" and the
next "callout_process()" call, "state->nextcallopt" will get updated
and "state->nextcall" will remain at the current time. Refer to logic
inside cpu_new_callout().
3) getnextcpuevent() only respects "state->nextcall" and returns this
value over and over again, even if it is in the past, until "now >=
state->nextcallopt" becomes true. Then these two time variables are
corrected by a "callout_process()" call and the situation goes back to
normal.
The problem manifests itself in different ways. The common factor is
the timer process(es) consume all CPU on one or more CPU cores for a
long time, blocking other kernel processes from getting execution
time. This can be seen by very high interrupt counts as displayed by
"vmstat -i | grep timer" right after boot.
When EARLY_AP_STARTUP was enabled in r310177 the likelyhood of hitting
this bug apparently increased.
Example output from "vmstat -i" before patch:
cpu0:timer 7591 69
cpu9:timer 39031773 358089
cpu4:timer 9359 85
cpu3:timer 9100 83
cpu2:timer 9620 88
Example output from "vmstat -i" after patch:
cpu0:timer 4242 34
cpu6:timer 5531 44
cpu3:timer 6450 52
cpu1:timer 4545 36
cpu9:timer 7153 58
Before the patch cpu9 in the example above, was spinning in a loop in
order to reach 39 million interrupts just a few seconds after
bootup. After the patch the timer interrupt counts are more or less
consistent.
Discussed with: mav @
Reported by: several people
MFC after: 1 week
Sponsored by: Mellanox Technologies
It's possible to get EFAULT when writing a segment backed by a file
if the segment extends beyond the file.
The core dump could still be useful if we skip the rest of the segment
and proceed to other segements.
The skipped segment (or a portion of it) will be zero-filled.
While there, use 'const' to signify that core_write() only reads the
buffer and use __DECONST before calling vn_rdwr_inchunks() because it
can be used for both reading and writing.
Before the change:
kernel: Failed to write core file for process mmap_trunc_core (error 14)
kernel: pid 77718 (mmap_trunc_core), uid 1001: exited on signal 6
After the change:
kernel: Failed to fully fault in a core file segment at VA 0x800645000 with size 0x4000 to be written at offset 0x29000 for process mmap_trunc_core
kernel: pid 4901 (mmap_trunc_core), uid 1001: exited on signal 6 (core dumped)
Reviewed by: julian, kib
Obtained from: Panzura (older version of the change)
MFC after: 5 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9233