Rack includes the following features:
- A different SACK processing scheme (the old sack structures are not used).
- RACK (Recent acknowledgment) where counting dup-acks is no longer done
instead time is used to knwo when to retransmit. (see the I-D)
- TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
to try not to take a retransmit time-out. (see the I-D)
- Burst mitigation using TCPHTPS
- PRR (partial rate reduction) see the RFC.
Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.
Note that any connection that does not support SACK will be kicked
back to the "default" base FreeBSD stack (currently known as "default").
To build this into your kernel you will need to enable in your
kernel:
makeoptions WITH_EXTRA_TCP_STACKS=1
options TCPHPTS
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15525
controller that tries to handle early invocations of the controller,
in other words, invocations before the expected end of the interval.
However, there were some calculation errors in this early invocation
case. Notably, if an early invocation occurred while the error was
negative, the derivative term was off by a large amount. One visible
effect of this error was that processes were being killed by the
virtual memory system's OOM killer when in fact there was plentiful
free memory.
Correct a couple minor errors in the sysctl descriptions, and apply
some style fixes.
Reviewed by: jeff, markj
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).
This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.
On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.
It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.
By logging all threads and processes 'pmc filter'
can now filter on process or thread name, relieving
the user of the burden of determining which tid or
pid was which when the sample was taken.
% pmc filter -T if_io_tqg -P nginx pmc.log pmc-iflib.log
% pmc filter -x -T idle pmc.log pmc-noidle.log
Previously, libc.so would initialize its notion of the break address
using _end, a special symbol emitted by the static linker following
the bss section. Compatibility issues between lld and ld.bfd could
cause the wrong definition of _end (libc.so's definition rather than
that of the executable) to be used, breaking the brk()/sbrk()
interface.
Avoid this problem and future interoperability issues by simply not
relying on _end. Instead, modify the break() system call to return
the kernel's view of the current break address, and have libc
initialize its state using an extra syscall upon the first use of the
interface. As a side effect, this appears to fix brk()/sbrk() usage
in executables run with rtld direct exec, since the kernel and libc.so
no longer maintain separate views of the process' break address.
PR: 228574
Reviewed by: kib (previous version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D15663
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.
Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.
Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.
Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.
Reviewed by: kib, markj
MFC after: 7 days
This adds the -U options to pmcstat which will attribute in-kernel samples
back to the user stack that invoked the system call. It is not the default,
because when looking at kernel profiles it is generally more desirable to
merge all instances of a given system call together.
Although heavily revised, this change is directly derived from D7350 by
Jonathan T. Looney.
Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks
Plenty of allocation sites pass M_ZERO and sizes which are small and known
at compilation time. Handling them internally in malloc loses this information
and results in avoidable calls to memset.
Instead, let the compiler take the advantage of it whenever possible.
Discussed with: jeff
Filesystem or pager completion callbacks are generally non-functional
after a panic and may trigger deadlocks if invoked in this context
(e.g., by attempting to destroying a buffer mapping). To avoid this
situation, short-circuit I/O completion in biodone().
Reviewed by: imp
Discussed with: mav
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D15592
Rather than using fixed-length strings, pack them into a string table
to return. Also expand the buffer from ~300 charaters to 3k. This should
be enough, even for USB.
This fixes a problem where USB pnp info is truncated on return to
userland.
Differential Revision: https://reviews.freebsd.org/D15629
Previously we'd report that a file has "no valid symbol table" if it in
fact had two or more. Change the message to report that there must be
exactly one.
The *name parameter passed to iflib_irq_alloc_generic and
iflib_softirq_alloc_generic is never modified. Many places in code pass
string literals and thus should not be modified.
Mark the *name parameter as a const char * instead, so that we enforce
that the name is not modified before passing to bus_describe_intr()
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: kmacy
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15343
The assertion would never fire without truly spectacular future
programming errors.
Reported by: Coverity
CID: 1391367, 1391368
Sponsored by: DARPA, AFRL
This change adds a new optional console method cn_resume and a kernel
console interface cnresume. Consoles that may need to re-initialize
their hardware after suspend (e.g., because firmware does not care to do
it) will implement cn_resume. Note that it is called in rather early
environment not unlike early boot, so the same restrictions apply.
Platform specific code, for platforms that support hardware suspend,
should call cnresume early after resume, before any console output is
expected.
This change fixes a problem with a system of mine failing to resume when
a serial console is used. I found that the serial port was in a strange
configuration and an attempt to write to it likely resulted in an
infinite loop.
To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER
macro has been extended to support optional methods.
Reviewed by: imp, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15552
The vadvise syscall (aka ovadvise) is undocumented and has always been
implmented as returning EINVAL. Put the syscall under COMPAT11 and
provide a userspace implementation.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15557
to themselves. The updated code assumed that that could not happen
and would try to lock the unp mutex twice.
There may be a lingering issue here but this fixes it for the
reporter.
PR: 228458
Reported by: marieheleneka at gmail.com
It was introduced to the tree in r169320 and r169321 in May 2007.
It never got much use and never became a kernel default. The code
duplicates the default path quite a bit, with slight modifications. Just
yank out the cruft. Whatever goals were being aimed for can probably be met
within the existing framework, without a flag day option.
Mostly mechanical change: 'unifdef -m -UINTR_FILTER'.
Reviewed by: mmacy
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15546
Instead, construct an auxargs array and copy it out all at once.
Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.
Return errors on copyout() and suword() failures and handle them in the
caller.
Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).
Reviewed by: kib
Comments from: emaste, jhb
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15485
Yet another preemption request hitting between the counter being 0
and the check being reached will result in the flag no longer being
set.
Note the situation was already present prior to r334062 and is harmless.
Reported by: pho
Reviewed by: kib
When large SPDs are used, we face two problems:
- too many CPU cycles are spent during the linear searches in the SPD
for each packet
- too much contention on multi socket systems, since we use a single
shared lock.
Main changes:
- added the sysctl tree 'net.key.spdcache' to control the SPD cache
(disabled by default).
- cache the sp indexes that are used to perform SP lookups.
- use a range of dedicated mutexes to protect the cache lines.
Submitted by: Emeric Poupon <emeric.poupon@stormshield.eu>
Reviewed by: ae
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D15050
If a thread waiting on sx dropped Giant it would not be properly
reacquired on exit from the routine, later resulting in panics
indicating Giant is not held (when it should be).
The bug was not present in the original patch sent to pho, I wittingly
added it just prior to the commit and only smoke-tested it.
Reported by: pho
A constant stream of readers could completely starve writers and this is not
a hypothetical scenario.
The 'poll2_threads' test from the will-it-scale suite reliably starves writers
even with concurrency < 10 threads.
The problem was run into and diagnosed by dillon@backplane.com
There was next to no change in lock contention profile during -j 128 pkg build,
despite an sx lock being at the top.
Tested by: pho
Writers waiting on readers to finish can set the RW_LOCK_WRITE_SPINNER
bit. This prevents most new readers from coming on. However, the last
reader to unlock also clears the bit which means new readers can sneak
in and the cycle starts over.
Change the code to keep the bit after last unlock.
Note that starvation potential is still there: no matter how many write
spinners are there, there is one bit. After the writer unlocks, the lock
is free to get raided by readers again. It is good enough for the time
being.
The real fix would include counting writers.
This runs into a caveat: the writer which set the bit may now be preempted.
In order to get rid of the problem all attempts to set the bit are preceeded
with critical_enter.
The bit gets cleared when the thread which set it goes to sleep. This way
an invariant holds that if the bit is set, someone is actively spinning and
will grab the lock soon. In particular this means that readers which find
the lock in this transient state can safely spin until the lock finds itself
an owner (i.e. they don't need to block nor speculate how long to spin
speculatively).
Tested by: pho
I have a system that is very unstable after resuming from suspend-to-RAM
but only if HPET is used as the event timer. The theory is that SMM
code / firmware could be enabling HPET for its own uses and unexpected
interrupts cause a trouble for it. Originally I wanted to solve the
problem in hpet_suspend() method, but that was insufficient as the event
timer could get reprogrammed again.
So, it's better, for my case and in general, to stop the event timer(s)
before entering the hardware suspend.
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D15413
They're only useful when multiple threads may share an epoch record,
and that can't happen with non-preemptible sections.
Reviewed by: mmacy
Differential Revision: https://reviews.freebsd.org/D15507
The interlock pointer is non-NULL by definition and the compiler see through
that and eliminates the NULL checks. Just remove them from the code as they
play no role.
No difference in generated assembly.
two copyrights. The line originated with the Berkeely Regents, who
we have not approached about removing it (it's honestly too trivial
to be worth that fight). Restore it to rwatson's line as well. He
can decide if he wants it or not on his own. Matt clearly doesn't
want it, per project preference and his own statements on IRC.
Noticed by: rgrimes@
Archs using in-tree gcc were broken with `warning: redundant
redeclaration of 'cap_no_rights' [-Wredundant-decls]`.
Sponsored by: The FreeBSD Foundation
The code is pretty much guaranteed not to be able to unlock.
This is a minor nit. The code still performs way too many reads.
The altered exclusive-locked condition is supposed to be always
true as well, to be cleaned up at a later date.
There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.
Reported by: markj
adds:
- epoch_enter_critical() - can be called inside a different epoch,
starts a section that will acquire any MTX_DEF mutexes or do
anything that might sleep.
- epoch_exit_critical() - corresponding exit call
- epoch_wait_critical() - wait variant that is guaranteed that any
threads in a section are running.
- epoch_global_critical - an epoch_wait_critical safe epoch instance
Requested by: markj
Approved by: sbruno
This simplifies the use of the path variable by making it NUL
terminated. This is a prerequisite for further cleanups.
Reviewed by: imp
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15467
This change moves to using a reference count across lock drop / reacquire
to guarantee liveness.
Currently sends on unix sockets contend heavily on read locking the list lock.
unix1_processes in will-it-scale peaks at 6 processes and then declines.
With this change I get a substantial improvement in number of operations per second
with 96 processes:
x before
+ after
N Min Max Median Avg Stddev
x 11 1688420 1696389 1693578 1692766.3 2971.1702
+ 10 63417955 71030114 70662504 69576423 2374684.6
Difference at 95.0% confidence
6.78837e+07 +/- 1.49463e+06
4010.22% +/- 88.4246%
(Student's t, pooled s = 1.63437e+06)
And even for 2 processes shows a ~18% improvement.
"Small" iron changes (1, 2, and 4 processes):
x before1
+ after1.2
+------------------------------------------------------------------------+
| + |
| x + |
| x + |
| x + |
| x ++ |
| xx ++ |
|x x xx ++ |
| |__________________A_____M_____AM____||
+------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 1131648 1197750 1197138.5 1190369.3 20651.839
+ 10 1203840 1205056 1204919 1204827.9 353.27404
Difference at 95.0% confidence
14458.6 +/- 13723
1.21463% +/- 1.16683%
(Student's t, pooled s = 14605.2)
x before2
+ after2.2
+------------------------------------------------------------------------+
| +|
| +|
| +|
| +|
| +|
| +|
| x +|
| x +|
| x xx +|
|x xxxx +|
| |___AM_| A|
+------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 1972843 2045866 2038186.5 2030443.8 21367.694
+ 10 2400853 2402196 2401043.5 2401172.7 385.40024
Difference at 95.0% confidence
370729 +/- 14198.9
18.2585% +/- 0.826943%
(Student's t, pooled s = 15111.7)
x before4
+ after4.2
N Min Max Median Avg Stddev
x 10 3986994 3991728 3990137.5 3989985.2 1300.0164
+ 10 4799990 4806664 4806116.5 4805194 1990.6625
Difference at 95.0% confidence
815209 +/- 1579.64
20.4314% +/- 0.0421713%
(Student's t, pooled s = 1681.19)
Tested by: pho
Reported by: mjg
Approved by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15430
Add epoch section to struct thread. We can use this to
ennable epoch counter to advance even if a section is
perpetually occupied by a thread.
Approved by: sbruno
This implements per-thread counters for PMC sampling. The thread
descriptors are stored in a list attached to the process descriptor.
These thread descriptors can store any per-thread information necessary
for current or future features. For the moment, they just store the counters
for sampling.
The thread descriptors are created when the process descriptor is created.
Additionally, thread descriptors are created or freed when threads
are started or stopped. Because the thread exit function is called in a
critical section, we can't directly free the thread descriptors. Hence,
they are freed to a cache, which is also used as a source of allocations
when needed for new threads.
Approved by: sbruno
Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15335
... to process input, instead of inside each smaller operations such as
appending a character or moving the cursor forward.
In other words, before we were doing (oversimplified):
teken_input()
<for each input character>
vtterm_putchar()
VTBUF_LOCK()
VTBUF_UNLOCK()
vtterm_cursor_position()
VTBUF_LOCK()
VTBUF_UNLOCK()
Now, we are doing:
vtterm_pre_input()
VTBUF_LOCK()
teken_input()
<for each input character>
vtterm_putchar()
vtterm_cursor_position()
vtterm_post_input()
VTBUF_UNLOCK()
The situation was even worse when the vtterm_copy() and vtterm_fill()
callbacks were involved.
The new callbacks are:
* struct terminal_class->tc_pre_input()
* struct terminal_class->tc_post_input()
They are called in teken_input(), surrounding the while() loop.
The goal is to improve input processing speed of vt(4). As a benchmark,
here is the time taken to write a text file of 360 000 lines (26 MiB) on
`ttyv0`:
* vt(4), unmodified: 1500 ms
* vt(4), with this patch: 1200 ms
* syscons(4): 700 ms
This is on a Haswell laptop with a GENERIC-NODEBUG kernel.
At the same time, the locking is changed in the vt_flush() function
which is responsible to draw the text on screen. So instead of
(indirectly) using VTBUF_LOCK() just to read and reset the dirty area
of the internal buffer, the lock is held for about the entire function,
including the drawing part.
The change is mostly visible while content is scrolling fast: before,
lines could appear garbled while scrolling because the internal buffer
was accessed without locks (once the scrolling was finished, the output
was correct). Now, the scrolling appears correct.
In the end, the locking model is closer to what syscons(4) does.
Differential Revision: https://reviews.freebsd.org/D15302
- fix load/unload race by allocating the per-domain list structure at boot
- fix long extant vm map LOR by replacing pmc_sx sx_slock with global_epoch
to protect the liveness of elements of the pmc_ss_owners list
Reported by: pho
Approved by: sbruno
The INVARIANTS checks in epoch_wait() were intended to
prevent the block handler from returning with locks held.
What it in fact did was preventing anything except Giant
from being held across it. Check that the number of locks
held has not changed instead.
Approved by: sbruno@
- GC the _nopreempt routines
- to really benefit we'd need a separate routine
- they're not currently in use
- they complicate the API for no benefit at this time
- check that we're actually in a epoch section at exit
- handle epoch_call() early in boot
- Fix copyright declaration language
Approved by: sbruno@
- Create getblkx(9) variant of getblk(9) which can return error.
- Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP
was performed before the buffer is created, and EJUSTRETURN returned
in case the requested block does not exist.
- Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and
allocating the pages for it), copying from zero_region instead.
The end result is less page allocations and buffer recycling when a
hole is read, which is important for some benchmarks.
Requested and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D14917
It appears that domain information is set correctly independent
of whether or not NUMA is defined. However, there is no memory
backing secondary domains leading to allocation failure.
Reported by: pho@, np@
Approved by: sbruno@
On non-trivial SMP systems the contention on the pmc_owner mutex leads
to a substantial number of samples captured being from the pmc process
itself. This change a) makes buffers larger to avoid contention on the
global list b) makes the working sample buffer per cpu.
Run pmcstat in the background (default event rate of 64k):
pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 &
Before:
make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total
After:
make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total
For more realistic overhead measurement set the sample rate for ~2khz
on a 2.1Ghz processor:
pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 &
Collecting 10 samples of `make -j96 buildkernel` from each:
x before
+ after
real time:
N Min Max Median Avg Stddev
x 10 76.4 127.62 84.845 88.577 15.100031
+ 10 59.71 60.79 60.135 60.179 0.29957192
Difference at 95.0% confidence
-28.398 +/- 10.0344
-32.0602% +/- 7.69825%
(Student's t, pooled s = 10.6794)
system time:
N Min Max Median Avg Stddev
x 10 2277.96 6948.53 2949.47 3341.492 1385.2677
+ 10 1038.7 1081.06 1070.555 1064.017 15.85404
Difference at 95.0% confidence
-2277.47 +/- 920.425
-68.1574% +/- 8.77623%
(Student's t, pooled s = 979.596)
x no pmc
+ pmc running
real time:
HEAD:
N Min Max Median Avg Stddev
x 10 58.38 59.15 58.86 58.847 0.22504567
+ 10 76.4 127.62 84.845 88.577 15.100031
Difference at 95.0% confidence
29.73 +/- 10.0335
50.5208% +/- 17.0525%
(Student's t, pooled s = 10.6785)
patched:
N Min Max Median Avg Stddev
x 10 58.38 59.15 58.86 58.847 0.22504567
+ 10 59.71 60.79 60.135 60.179 0.29957192
Difference at 95.0% confidence
1.332 +/- 0.248939
2.2635% +/- 0.426506%
(Student's t, pooled s = 0.264942)
system time:
HEAD:
N Min Max Median Avg Stddev
x 10 1010.15 1073.31 1025.465 1031.524 18.135705
+ 10 2277.96 6948.53 2949.47 3341.492 1385.2677
Difference at 95.0% confidence
2309.97 +/- 920.443
223.937% +/- 89.3039%
(Student's t, pooled s = 979.616)
patched:
N Min Max Median Avg Stddev
x 10 1010.15 1073.31 1025.465 1031.524 18.135705
+ 10 1038.7 1081.06 1070.555 1064.017 15.85404
Difference at 95.0% confidence
32.493 +/- 16.0042
3.15% +/- 1.5794%
(Student's t, pooled s = 17.0331)
Reviewed by: jeff@
Approved by: sbruno@
Differential Revision: https://reviews.freebsd.org/D15155
- initialize the pcpu STAILQ in the NUMA case
- don't enqueue the callback task if there isn't sufficient work to be done
Reported by: pho@
Approved by: sbruno@
- Lend priority to preempted threads in epoch_wait to handle the case
in which we've had priority lent to us. Previously we borrowed the
priority of the lowest priority preempted thread. (pointed out by mjg@)
- Don't attempt allocate memory per-domain on powerpc, we don't currently
handle empty sockets (as is the case on jhibbits Talos' board).
- Handle deferred callbacks as pcpu lists and poll the lists periodically.
Currently the interval is 1/hz.
- Drop the thread lock when adaptive spinning. Holding the lock starves
other threads and can even lead to lockups.
- Keep a generation count pcpu so that we don't keep spining if a thread
has left and re-entered an epoch section.
- Actually removed the callback from the callback list so that we don't
double free. Sigh ...
Approved by: sbruno@
Read locking is over used in the kernel to guarantee liveness. This API makes
it easy to provide livenes guarantees without atomics.
Includes epoch_test kernel module to stress test the API.
Documentation will follow initial use case.
Test case and improvements to preemption handling in response to discussion
with mjg@
Reviewed by: imp@, shurd@
Approved by: sbruno@
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see
correct flags, most important is SMAP.
Tested by: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D15367
KTRPOINT() checks both if we are tracing _and_ if we are recursing within
ktrace. The second condition is only ever executed if ktrace is actually
enabled. This change moves the check out of the hot path in to the functions
themselves.
Discussed with mjg@
Reported by: mjg@
Approved by: sbruno@
The entire mechanism is rarely used and is quite not performant due to
atomci ops on the syscall table. It also has added overhead for completely
unrelated syscalls.
Reduce it by avoiding the func calls if possible (which consistutes vast
majority of cases).
Provides about 3% syscall rate speed up for getuid on Broadwell.
If you add a child to a device that has quiet children, we'll
automatically set the quiet flag on the children, and its
children.
This is indended for things like CPU that have a large amount of
repetition in booting that adds nothing.
Boost the priority of user-space threads when they set
their affinity to a core to adjust its frequency. This avoids a situation
where a CPU bound kernel thread with the same affinity is running on a
down-clocked core, and will "block" powerd from up-clocking the core
until the kernel thread yields. This can lead to poor perfomance,
and to things potentially getting stuck on Giant.
Reviewed by: kib (imp reviewed earlier version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15246
The aim is to permit mbuf allocations after a panic without calling into
the page allocator, without imposing any runtime overhead during regular
operation of the system, and without modifying driver code. The approach
taken is to preallocate a number of mbufs and clusters, storing them
in linked lists, and using the lists to back some UMA cache zones. At
panic time, the mbuf and cluster zone pointers are overwritten with
those of the cache zones so that the mbuf allocator returns
preallocated items.
Using this scheme, drivers which cache mbuf zone pointers from
m_getzone() require special handling when implementing netdump support.
Reviewed by: cem (earlier version), julian, sbruno
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15251
For most buses all children are equal, so the order does not matter.
Other buses, such as acpi, carefully order their child devices to
express implicit dependencies between them. For such buses it is safer
to bring down devices in the reverse order.
I believe that this is the reason why hpet_suspend had to be disabled.
Some drivers depend on a working event timer until they are suspended.
But previously we would suspend hpet very early.
I tested this change by makinbg hpet_suspend actually stop HPET timers
and tested that too.
Note that this change is not a complete solution as it does not take
into account bus passes.
A better approach would be to track the actual attach order of the
devices and to use the reverse of that.
Reviewed by: imp, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15291
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.
Reviewed by: kib
Differential Revision: D14681
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.
Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280
x dup_before
+ dup_after
+------------------------------------------------------------+
| x + |
|x x x x ++ ++|
| |____AM___| |AM||
+------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08 341205.71
+ 5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08 96232.829
Difference at 95.0% confidence
3.08952e+06 +/- 365604
2.03266% +/- 0.245071%
(Student's t, pooled s = 250681)
Reported by: mjg@
MFC after: 1 week
Required MD bits are only provided for x86.
Reviewed by: jhb (previous version, as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13838
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.
Submitted by: mmacy <mmacy@mattmacy.io>
Reviewed by: shurd, sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14969
Dumpers may wish to print messages from an initialization hook; this
change ensures that such messages aren't mixed with output from the
generic dump code.
MFC after: 1 week
Previously it was possible to connect a socket (which had the
CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in
capabilties mode. This combination should be treated the same as a call
to connect (i.e. forbidden in capabilities mode). Similarly for bindat.
Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the
documentation and add tests.
PR: 222632
Submitted by: Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by: Domagoj Stolfa
MFC after: 1 week
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D15221
While here predict false.
Note the code is wrong (regardless of this change). Dereference of the
pointer can race with module unload. A fix would set the probe to a
nop stub instead of NULL.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.
Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893
To totally silence and ignore secondary kassert violations after a primary
panic, set debug.kassert.do_log=0 and debug.kassert.suppress_in_panic=1.
Additional assertion warnings shouldn't block core dump and may alert the
developer to another erroneous condition. Secondary stack traces may be
printed, identically to the unsuppressed case where panic() is reentered --
controlled via debug.trace_all_panics.
Sponsored by: Dell EMC Isilon
This has been an (incorrect) copy-paste duplicate of debug.kassert.warn_only
since it was originally committed in r243980.
Sponsored by: Dell EMC Isilon
To diagnose and fix secondary panics, it is useful to have a stack trace.
When panic tracing is enabled, optionally trace secondary panics as well.
The option is configured with the tunable/sysctl debug.trace_all_panics.
(The original concern that inspired only tracing the primary panic was
likely that the secondary trace may scroll the original panic message or trace
off the screen. This is less of a concern for serial consoles with logging.
Not everything has a serial console, though, so the behavior is optional.)
Discussed with: jhb
Sponsored by: Dell EMC Isilon
assertions to not suppressing post-panic assertions.
There are some post-panic assertions that are valuable and we shouldn't
default to disabling them. However, when a user trips over them, the
user can still adjust the tunable/sysctl to suppress them temporarily to
get conduct troubleshooting (e.g. get a core dump).
Reported by: cem, markj
r313683 introduced new lockmgr APIs that missed the panic-time neutering
present in the rest of our locks. Correct that by adding the usual check.
Additionally, move the __lockmgr_args neutering above the assertions at the
top of the function. Drop the interlock unlock because we shouldn't have
an unneutered interlock either. No point trying to unlock it.
PR: 227749
Reported by: jtl
Sponsored by: Dell EMC Isilon
Each malloc/free was testing dtrace_malloc_enabled and forcing
extra reads from the malloc type struct to see if perhaps a
dtmalloc probe was on.
Treat it like lockstat and sdt: have a global bolean.
malloc was showing at the top of profile during while running microbenchmarks.
#define DTMALLOC_PROBE_MAX 2
struct malloc_type_internal {
uint32_t mti_probes[DTMALLOC_PROBE_MAX];
u_char mti_zone;
struct malloc_type_stats mti_stats[MAXCPU];
};
Reading mti_zone it wastes a cacheline to hold mti_probes + mti_zone
(which we know is 0) + part of malloc stats of the first cpu which on top
induces false-sharing.
In particular will-it-scale lock1_processes -t 128 -s 10:
before: average:45879692
after: average:51655596
Note the counters can be padded but the right fix is to move them to
counter(9), leaving the struct read-only after creation (modulo dtrace
probes).
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).
Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
This combined with previous changes significantly depessimizes the behaviour
under contentnion.
In particular the lock1_processes test (locking/unlocking separate files)
from the will-it-scale suite was executed with 128 concurrency on a
4-socket Broadwell with 128 hardware threads.
Operations/second (lock+unlock) go from ~750000 to ~45000000 (6000%)
For reference single-process is ~1680000 (i.e. on stock kernel the resulting
perf is less than *half* of the single-threaded run),
Note this still does not really scale all that well as the locks were just
bolted on top of the current implementation. Significant room for improvement
is still here. In particular the top performance fluctuates depending on the
extent of false sharing in given run (which extends beyond the file).
Added chain+lock pairs were not padded w.r.t. cacheline size.
One big ticket item is the hash used for spreading threads: it used to be the
process pid (which basically serialized all threaded ops). Temporarily the
vnode addr was slapped in instead.
Tested by: pho
kproc_suspend_check. In r329612 bufspacedaemon was turned into a thread
of the bufdaemon process causing both to call kproc_suspend_check with the
same proc argument and that function contains the following while loop:
while (SIGISMEMBER(p->p_siglist, SIGSTOP)) {
wakeup(&p->p_siglist);
msleep(&p->p_siglist, &p->p_mtx, PPAUSE, "kpsusp", 0);
}
So one thread wakes up the other and the other wakes up the first again,
locking up UP machines on shutdown.
Also register the shutdown handlers with SHUTDOWN_PRI_LAST + 100 so they
run after the syncer has shutdown, because the syncer can cause a
situation where bufdaemon help is needed to proceed.
PR: 227404
Reviewed by: kib
Tested by: cy, rmacklem
1. check if P_ADVLOCK is already set and if so, don't lock to set it
(stolen from DragonFly)
2. when trying for fast path unlock, check that we are doing unlock
first instead of taking the interlock for no reason (e.g. if we want
to *lock*). whilere make it more likely that falling fast path will
not take the interlock either by checking for state
Note the code is severely pessimized both single- and multithreaded.
these assumptions may not hold true once we've panic'd. Therefore, the
checks hold less value after a panic. Additionally, if one of the checks
fails while we are already panic'd, this creates a double-panic which can
interfere with debugging the original panic.
Therefore, this commit allows an administrator to suppress a response to
KASSERT checks after a panic by setting a tunable/sysctl. The
tunable/sysctl (debug.kassert.suppress_in_panic) defaults to being
enabled.
Reviewed by: kib
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D12920
-> PROC_PDEATHSIG_STATUS for consistency with other procctl(2)
operations names.
Requested by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 13 days
The lock is required to ensure that the switch to the new credentials
and the transfer of the process's accounting data from the old
credentials to the new ones is done atomically. Otherwise, some updates
may be applied to the new credentials and then additionally transferred
from the old credentials if the updates happen after proc_set_cred() and
before racct_proc_ucred_changed().
The problem is especially pronounced for RACCT_RSS because
- there is a strict accounting for this resource (it's reclaimable)
- it's updated asynchronously by the vm daemon
- it's updated by setting an absolute value instead of applying a delta
I had to remove a call to rctl_proc_ucred_changed() from
racct_proc_ucred_changed() and make all callers of latter call the
former as well. The reason is that rctl_proc_ucred_changed, as it is
implemented now, cannot be called while holding the proc lock, so the
lock is dropped after calling racct_proc_ucred_changed. Additionally,
I've added calls to crhold / crfree around the rctl call, because
without the proc lock there is no gurantee that the new credentials,
owned by the process, will stay stable. That does not eliminate a
possibility that the credentials passed to the rctl will get stale.
Ideally, rctl_proc_ucred_changed should be able to work under the proc
lock.
Many thanks to kib for pointing out the above problems.
PR: 222027
Discussed with: kib
No comment: trasz
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D15048
Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().
Reviewed by: emaste
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15123
Allow processes to request the delivery of a signal upon death of
their parent process. Supposed consumer of the feature is PostgreSQL.
Submitted by: Thomas Munro
Reviewed by: jilles, mjg
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D15106
fget_cap() tries to do a cheaper snapshot of a file descriptor without
holding the file descriptor lock. This snapshot does not do a deep
copy of the ioctls capability array, but instead uses a different
return value to inform the caller to retry the copy with the lock
held. However, filecaps_copy() was returning 1 to indicate that a
retry was required, and fget_cap() was checking for 0 (actually
'!filecaps_copy()'). As a result, fget_cap() did not do a deep copy
of the ioctls array and just reused the original pointer. This cause
multiple file descriptor entries to think they owned the same pointer
and eventually resulted in duplicate frees.
The only code path that I'm aware of that triggers this is to create a
listen socket that has a restricted list of ioctls and then call
accept() which calls fget_cap() with a valid filecaps structure from
getsock_cap().
To fix, change the return value of filecaps_copy() to return true if
it succeeds in copying the caps and false if it fails because the lock
is required. I find this more intuitive than fixing the caller in
this case. While here, change the return type from 'int' to 'bool'.
Finally, make filecaps_copy() more robust in the failure case by not
copying any of the source filecaps structure over. This avoids the
possibility of leaking a pointer into a structure if a similar future
caller doesn't properly handle the return value from filecaps_copy()
at the expense of one more branch.
I also added a test case that panics before this change and now passes.
Reviewed by: kib
Discussed with: mjg (not a fan of the extra branch)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D15047
Always take the AST path rather than calling MD functions which are
often implemented as always failing. The is the case on amd64, arm,
i386, and powerpc. This optimization (inherited from 4.4 Lite) is a
pessimization on those architectures and is the sole use of these
functions. They will be removed in a seperate commit.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15101
This behavior is already documented by the man page, and suggested by POSIX.
Reviewed by: jhb
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15099
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.
By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap. The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.
There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.
copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline. If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.
The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done. The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging. I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.
Tested by: pho
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D14633
Now that 10 years have passed since the original limit of 10000 was
committed, bump it a little bit.
Spinning waiting for writers is semi-informed in the sense that we always
know if the owner is running and base the decision to spin on that.
However, no such information is provided for read-locking. In particular
this means that it is possible for a write-spinner to completely waste cpu
time waiting for the lock to be released, while the reader holding it was
preempted and is now waiting for the spinner to go off cpu.
Nonetheless, in majority of cases it is an improvement to spin instead of
instantly giving up and going to sleep.
The current approach is pretty simple: snatch the number of current readers
and performs that many pauses before checking again. The total number of
pauses to execute is limited to 10k. If the lock is still not free by
that time, go to sleep.
Given the previously noted problem of not knowing whether spinning makes
any sense to begin with the new limit has to remain rather conservative.
But at the very least it should also be related to the machine. Waiting
for writers uses parameters selected based on the number of activated
hardware threads. The upper limit of pause instructions to be executed
in-between re-reads of the lock is typically 16384 or 32678. It was
selected as the limit of total spins. The lower bound is set to
already present 10000 as to not change it for smaller machines.
Bumping the limit reduces system time by few % during benchmarks like
buildworld, buildkernel and others. Tested on 2 and 4 socket machines
(Broadwell, Skylake).
Figuring out how to make a more informed decision while not pessimizing
the fast path is left as an exercise for the reader.
Sometimes the values contain geli passphrases being communicated from
loader(8) to the kernel, and some day the compiler may decide to start
eliding calls to memset() for a pointer which is not dereferenced again
before being passed to free().
This allows NIC drivers to sleep on polling config operations.
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14982
Previous limits were chosen when locking primitives had spurious lock
accesses.
Flipping the starting point to 1 (or rather 2 as the first call shifts it)
provides a modest win when mild contention is seen while not hurting worse
cases. Tested on a bunch of one, two and four socket old and new systems
(Westmere, Skylake, Threadreaper and others) by doing concurrent page faults,
buildkernel/buildworld and other stuff (although not all systems got all the
tests).
Another thing is the upper limit. It is semi-arbitrarily chosen as it was
getting out of hand for slightly less small systems (e.g. a 128-thread one).
Note that backoff is fundamentally a speculative bandaid and this change just
makes it fit a little bit better. It remains completely oblivious to the
hardware topology or the contention pattern. This is being experimented with.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
Modify makesyscalls.sh to strip out SAL annotations.
No functional change.
This is based on work I started in CheriBSD and use to validate fat
pointers at the syscall boundary. Tal Garfinkel reviewed the changes,
added annotations to COMPAT* syscalls and is using them in a record and
playback framework. One can envision other uses such as a WITNESS-like
validator for copyin/out as speculated on in the review.
As this time we are only annotating sys/kern/syscalls.master as that is
sufficient for userspace work. If kernel use cases materialize, we can
annotate other syscalls.master as needed.
Submitted by: Tal Garfinkel <talg@cs.stanford.edu>
Sponsored by: DARPA, AFRL (in part)
Differential Revision: https://reviews.freebsd.org/D14285
Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.
Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user. This
also eliminates the need for the complicated interrupt binding code.
Add a sysctl API for viewing and manipulating domainsets. Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both. This probably belongs in a dedicated subr file.
Attempt to improve the include situation.
Reviewed by: kib
Discussed with: jhb (cpuset parts)
Tested by: pho (before review feedback)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14839
platforms. Original commit message as follows:
Only use CPUs in the domain the device is attached to for default
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.
Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14838
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code. Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.
Also, the change required making vfs_refcount_release_if_not_last()
public. I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now. While making the move I've dropped the
vfs_ prefix.
Reviewed by: mjg
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D14869
It is random collection of fixes for issues not yet corrected,
reported at https://tsyrklevi.ch/clang_analyzer/freebsd_013017/. Many
issues from that list were already corrected. Most of them are for
compat32, old compat32 or affect both primary host ABI and compat32.
The freebsd32_kldstat(), for instance, was already fixed by using
malloc(M_ZERO). Patch includes correction to report the supplied
version back, which is just pedantic.
Reviewed by: brooks, emaste (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14868
Include _uio.h instead of uio.h in several headers to reduce header
polution.
Fix a few places that relied on header polution to get the uio.h header.
I have not moved struct uio as many more things that use it rely on
header polution to get other definitions from uio.h.
Reviewed by: cem, kib, markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14811
If the operation is not an update, if neither r/w nor r/o mode is
explicitly requested, if the error code hints at the possibility of the
media being read-only, and if the fallback is allowed, then we can try
to automatically downgrade to the readonly mode.
This is especially useful for auto-mounting of removable media that
sometimes can happen to be write-protected.
The fallback to r/o is not enabled by default. It can be requested on a
per-mount basis with a new mount option, 'autoro'. Or it can be
globally allowed by setting vfs.default_autoro.
Reviewed by: cem, kib
MFC after: 3 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D13361
We may remove a sleepqueue from the hash table in
sleepq_resume_thread().
Reviewed by: kib
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14847
Assert that all such memory is unwired on return to usermode.
The count of the wired memory will be used to detect the copyout mode.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Most important for the future use, do not call
vm_fault_quick_hold_pages() with disabled pagefaults.
Reported and tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the task context.
shutdown_nice() is used from the fast interrupt handlers, mostly for
console drivers, where we cannot lock blockable locks. Schedule the
task in the fast queue to send the signal from the proper context.
Reviewed by: imp
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
summits at BSDCan and BSDCam in 2017.
The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.
It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.
You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.
This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.
There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.
Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085
The object lock was only needed when attempting to free B_DIRECT
buffer pages, and for testing for invalid pages (and freeing them
if so). Handle the latter by instead moving invalid pages near the head
of the inactive queue, where they will be reclaimed quickly.
Reviewed by: alc, kib, jeff
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14778
The general idea here is to provide userspace programs with well-defined
sources of entropy, in a fashion that doesn't require opening a new file
descriptor (ulimits) or accessing paths (/dev/urandom may be restricted
by chroot or capsicum).
getrandom(2) is the more general API, and comes from the Linux world.
Since our urandom and random devices are identical, the GRND_RANDOM flag
is ignored.
getentropy(3) is added as a compatibility shim for the OpenBSD API.
truss(1) support is included.
Tests for both system calls are provided. Coverage is believed to be at
least as comprehensive as LTP getrandom(2) test coverage. Additionally,
instructions for running the LTP tests directly against FreeBSD are provided
in the "Test Plan" section of the Differential revision linked below. (They
pass, of course.)
PR: 194204
Reported by: David CARLIER <david.carlier AT hardenedbsd.org>
Discussed with: cperciva, delphij, jhb, markj
Relnotes: maybe
Differential Revision: https://reviews.freebsd.org/D14500
flag and both the regular and "no" names, instead of two different string
arrays whose indices need to match the flag's bit position. This makes
them similar to the say "jailsys" options are represented.
Loop through either kind of option array with a structure pointer rather
then an integer index.
It is incomplete, has not been adopted in the other locking primitives,
and we have other means of measuring lock contention (lock_profiling,
lockstat, KTR_LOCK). Drop it to slightly de-clutter the mutex code and
free up a precious KTR class index.
Reviewed by: jhb, mjg
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14771
arg2 is an intmax_t, which on 32-bit architectures is 64 bits, wider than a
pointer. When &bdomain[i] is added to arg2 it widens from uintptr_t to
intmax_t, then gcc whines when it gets cast to a pointer. Casting through
uintptr_t silences this warning.
mtx_init does not do a copy of the name string it is passed. The
eventhandler code incorrectly passed the parameter string directly to
mtx_init instead of using the copy it makes. This was an existing
problem with the code that I dutifully copied over in my changes in r325621.
Reported by: Anton Rang <rang AT acm.org>
Reviewed by: rstone, markj
Approved by: rstone (mentor)
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14764
In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.
Reviewed by: alc
Tested by: pho
MFC after: 2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625
The slow path is always taken when lockstat is enabled. This induces
rdtsc (or other) calls to get the cycle count even when there was no
contention.
Still go to the slow path to not mess with the fast path, but avoid
the heavy lifting unless necessary.
This reduces sys and real time during -j 80 buildkernel:
before: 3651.84s user 1105.59s system 5394% cpu 1:28.18 total
after: 3685.99s user 975.74s system 5450% cpu 1:25.53 total
disabled: 3697.96s user 411.13s system 5261% cpu 1:18.10 total
So note this is still a significant hit.
LOCK_PROFILING results are not affected.
where we had not hit global dirty limits but a single queue was starved
for space by dirty buffers. A single buf_daemon is maintained for now.
Add a bd_speedup() when we are low on bufspace. This can happen due to SUJ
keeping many bufs locked until a cg block is written. Document this with
a comment.
Fix sysctls to work with per-domain variables. Add more ddb debugging.
Reported by: pho
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14705
It is believed that the conditions Coverity indicated were actually
impossible to hit. So this patch just adds a cleanup to only compute
v_mount once in brelse(), and in vfs_bio_getpages() always initializes error
to zero to appease the static analyzer.
No functional change intended.
Submitted by: Darrick Lew <darrick.freebsd AT gmail.com>
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14613
Migrate to modern types before creating MD Linuxolator bits for new
architectures.
Reviewed by: cem
Sponsored by: Turing Robotic Industries Inc.
Differential Revision: https://reviews.freebsd.org/D14676
The old code used the thread's pcb via the uap->data pointer.
Reviewed by: ed
Approved by: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14674
Remove NO_FUEWORD so the 'e' variants are wrapped by the non-'e'
variants. This is more correct and leaves sparc64 as the outlier.
Reviewed by: jmallett, kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14603
The vfs.mountroot.timeout tunable and .timeout directive in a mount.conf(5)
file allow specifying a wait timeout for the device(s) hosting the root
filesystem to become usable. The current mechanism for waiting for devices
and detecting their availability can't be used for zfs-hosted filesystems.
See the comment #20 in the PR for some expanded detail on these points.
This change adds retry logic to the actual root filesystem mount. That is,
insted of relying on device availability using device name lookups, it uses
the kernel_mount() call itself to detect whether the filesystem can be
mounted, and loops until it succeeds or the configured timeout is exceeded.
These changes are based on the patch attached to the PR, but it's rewritten
enough that all mistakes belong to me.
PR: 208882
X-MFC after: sufficient testing, and hopefully in time for 11.1
Non-NULL timeouts where copied in improperly and could produce failures
due to incompatible data structures.
Reviewed by: kib
MFC after: 3 days
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14587
Normally after grabbing the lock it has to be verified we got the right one
to begin with. However, if we are recursing, it must not change thus the
check can be avoided. In particular this avoids a lock read for non-recursing
case which found out the lock was changed.
While here avoid an irq trip of this happens.
Tested by: pho (previous version)
The code already pays the cost of reading the lock to obtain the waiters
flag. Checking whether there is more than one reader is not a problem and
avoids dirtying the line.
This also fixes a small corner case: if waiters were to show up between
reading the flag and upgrading the lock, the operation would fail even
though it should not. No correctness change here though.
If there were exactly rowner_retries/asx_retries (by default: 10) transitions
between read and write state and the waiters still did not get the lock, the
next owner -> reader transition would result in the code correctly falling
back to turnstile/sleepq where it would incorrectly think it was waiting
for a writer and decide to leave turnstile/sleepq to loop back. From this
point it would take ts/sq trips until the lock gets released.
The bug sometimes manifested itself in stalls during -j 128 package builds.
Refactor the code to fix the bug, while here remove some of the gratituous
differences between rw and sx locks.
The main routine takes 8 args, 3 of which are almost the same for most uses.
This in particular pushes it above the limit of 6 arguments passable through
registers on amd64 making it impossible to tail call.
This is a prerequisite for further cleanups.
Tested by: pho
This deliberately breaks the API in preperation for future syscall
revisions which will remove these nonstandard members.
In an exp-run a single port (devel/qemu-user-static) was found to
use them which it did becuase it emulates system calls. This has
been fixed in the ports tree.
PR: 224443 (exp-run)
Reviewed by: kib, jhb (previous version)
Exp-run by: antoine
Sponsored by: DARPA, AFRP
Differential Revision: https://reviews.freebsd.org/D14490
The first call is used to gauge how much spaces is needed. Just computing
the size instead of generating the output allows to not take the proctree
lock.
use it to regulate page daemon output.
This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.
Reviewed by: bsdimp
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14402
search for a thread to steal inside a critical section. Since this
allows the search to be preempted, restart the search if preemption
happens since the search results found earlier may no longer be
valid.
Decrease the latency of starting a thread that may be assigned to
this CPU during the search by polling for incoming threads during
the search and switching to that thread instead of continuing the
search.
Test for stale search results and restart the search before going
through the expense of calling tdq_lock_pair(). Retry some tests
after grabbing the locks since things may have changed while waiting
to get both locks.
Eliminate special case handling for stealing from an SMT peer that
uses 1 as the steal threshold. This can only succeed if a thread
has been assigned but our SMT peer has not yet started executing
it. This is quite rare and when it happens the other SMT thread
is generally waiting for the same tdq lock that we hold. Basically
both SMT threads are racing to grab the same spin lock.
Add the kern.sched.always_steal knob from a ULE patch by jeff@.
Incorporate another idea from Jeff's ULE patch. If the sched_switch()
detects that the CPU is about to go idle, try to steal a thread
before switching to the idle thread. Since the search for a thread
to steal has to be done inside a critical section in this context,
limit the impact on latency by adding the knob kern.sched.trysteal_limit
to limit the topological distance of the search and don't restart
the search if we detect stale results. If this search can't find
an stealable thread, the idle loop can do a more complete search.
Also poll for threads being assigned to this CPU during the search
and switch to them instead of continuing the search. This change
is responsibile for the majority of the improvement in parallel
buildworld times.
In sched_balance_group() change the minimum threshold from stealing
a thread from 1 to 2. Poaching a newly assigned thread from a CPU
that is waking up hasn't yet switched to that thread from idle is
likely very rare and is likely to have the same lock race as is
seen when stealing threads in the idle loop. Also use tdq_notify()
to kick the destintation CPU instead of always sending an IPI.
Update a stale comment, the number of transferable threads is not
calculated.
Reviewed by: kib (earlier version)
Comments by: avg, jeff, mav
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12130