This change moves to using a reference count across lock drop / reacquire
to guarantee liveness.
Currently sends on unix sockets contend heavily on read locking the list lock.
unix1_processes in will-it-scale peaks at 6 processes and then declines.
With this change I get a substantial improvement in number of operations per second
with 96 processes:
x before
+ after
N Min Max Median Avg Stddev
x 11 1688420 1696389 1693578 1692766.3 2971.1702
+ 10 63417955 71030114 70662504 69576423 2374684.6
Difference at 95.0% confidence
6.78837e+07 +/- 1.49463e+06
4010.22% +/- 88.4246%
(Student's t, pooled s = 1.63437e+06)
And even for 2 processes shows a ~18% improvement.
"Small" iron changes (1, 2, and 4 processes):
x before1
+ after1.2
+------------------------------------------------------------------------+
| + |
| x + |
| x + |
| x + |
| x ++ |
| xx ++ |
|x x xx ++ |
| |__________________A_____M_____AM____||
+------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 1131648 1197750 1197138.5 1190369.3 20651.839
+ 10 1203840 1205056 1204919 1204827.9 353.27404
Difference at 95.0% confidence
14458.6 +/- 13723
1.21463% +/- 1.16683%
(Student's t, pooled s = 14605.2)
x before2
+ after2.2
+------------------------------------------------------------------------+
| +|
| +|
| +|
| +|
| +|
| +|
| x +|
| x +|
| x xx +|
|x xxxx +|
| |___AM_| A|
+------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 1972843 2045866 2038186.5 2030443.8 21367.694
+ 10 2400853 2402196 2401043.5 2401172.7 385.40024
Difference at 95.0% confidence
370729 +/- 14198.9
18.2585% +/- 0.826943%
(Student's t, pooled s = 15111.7)
x before4
+ after4.2
N Min Max Median Avg Stddev
x 10 3986994 3991728 3990137.5 3989985.2 1300.0164
+ 10 4799990 4806664 4806116.5 4805194 1990.6625
Difference at 95.0% confidence
815209 +/- 1579.64
20.4314% +/- 0.0421713%
(Student's t, pooled s = 1681.19)
Tested by: pho
Reported by: mjg
Approved by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15430
Add epoch section to struct thread. We can use this to
ennable epoch counter to advance even if a section is
perpetually occupied by a thread.
Approved by: sbruno
This implements per-thread counters for PMC sampling. The thread
descriptors are stored in a list attached to the process descriptor.
These thread descriptors can store any per-thread information necessary
for current or future features. For the moment, they just store the counters
for sampling.
The thread descriptors are created when the process descriptor is created.
Additionally, thread descriptors are created or freed when threads
are started or stopped. Because the thread exit function is called in a
critical section, we can't directly free the thread descriptors. Hence,
they are freed to a cache, which is also used as a source of allocations
when needed for new threads.
Approved by: sbruno
Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15335
... to process input, instead of inside each smaller operations such as
appending a character or moving the cursor forward.
In other words, before we were doing (oversimplified):
teken_input()
<for each input character>
vtterm_putchar()
VTBUF_LOCK()
VTBUF_UNLOCK()
vtterm_cursor_position()
VTBUF_LOCK()
VTBUF_UNLOCK()
Now, we are doing:
vtterm_pre_input()
VTBUF_LOCK()
teken_input()
<for each input character>
vtterm_putchar()
vtterm_cursor_position()
vtterm_post_input()
VTBUF_UNLOCK()
The situation was even worse when the vtterm_copy() and vtterm_fill()
callbacks were involved.
The new callbacks are:
* struct terminal_class->tc_pre_input()
* struct terminal_class->tc_post_input()
They are called in teken_input(), surrounding the while() loop.
The goal is to improve input processing speed of vt(4). As a benchmark,
here is the time taken to write a text file of 360 000 lines (26 MiB) on
`ttyv0`:
* vt(4), unmodified: 1500 ms
* vt(4), with this patch: 1200 ms
* syscons(4): 700 ms
This is on a Haswell laptop with a GENERIC-NODEBUG kernel.
At the same time, the locking is changed in the vt_flush() function
which is responsible to draw the text on screen. So instead of
(indirectly) using VTBUF_LOCK() just to read and reset the dirty area
of the internal buffer, the lock is held for about the entire function,
including the drawing part.
The change is mostly visible while content is scrolling fast: before,
lines could appear garbled while scrolling because the internal buffer
was accessed without locks (once the scrolling was finished, the output
was correct). Now, the scrolling appears correct.
In the end, the locking model is closer to what syscons(4) does.
Differential Revision: https://reviews.freebsd.org/D15302
- fix load/unload race by allocating the per-domain list structure at boot
- fix long extant vm map LOR by replacing pmc_sx sx_slock with global_epoch
to protect the liveness of elements of the pmc_ss_owners list
Reported by: pho
Approved by: sbruno
The INVARIANTS checks in epoch_wait() were intended to
prevent the block handler from returning with locks held.
What it in fact did was preventing anything except Giant
from being held across it. Check that the number of locks
held has not changed instead.
Approved by: sbruno@
- GC the _nopreempt routines
- to really benefit we'd need a separate routine
- they're not currently in use
- they complicate the API for no benefit at this time
- check that we're actually in a epoch section at exit
- handle epoch_call() early in boot
- Fix copyright declaration language
Approved by: sbruno@
- Create getblkx(9) variant of getblk(9) which can return error.
- Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP
was performed before the buffer is created, and EJUSTRETURN returned
in case the requested block does not exist.
- Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and
allocating the pages for it), copying from zero_region instead.
The end result is less page allocations and buffer recycling when a
hole is read, which is important for some benchmarks.
Requested and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D14917
It appears that domain information is set correctly independent
of whether or not NUMA is defined. However, there is no memory
backing secondary domains leading to allocation failure.
Reported by: pho@, np@
Approved by: sbruno@
On non-trivial SMP systems the contention on the pmc_owner mutex leads
to a substantial number of samples captured being from the pmc process
itself. This change a) makes buffers larger to avoid contention on the
global list b) makes the working sample buffer per cpu.
Run pmcstat in the background (default event rate of 64k):
pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 &
Before:
make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total
After:
make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total
For more realistic overhead measurement set the sample rate for ~2khz
on a 2.1Ghz processor:
pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 &
Collecting 10 samples of `make -j96 buildkernel` from each:
x before
+ after
real time:
N Min Max Median Avg Stddev
x 10 76.4 127.62 84.845 88.577 15.100031
+ 10 59.71 60.79 60.135 60.179 0.29957192
Difference at 95.0% confidence
-28.398 +/- 10.0344
-32.0602% +/- 7.69825%
(Student's t, pooled s = 10.6794)
system time:
N Min Max Median Avg Stddev
x 10 2277.96 6948.53 2949.47 3341.492 1385.2677
+ 10 1038.7 1081.06 1070.555 1064.017 15.85404
Difference at 95.0% confidence
-2277.47 +/- 920.425
-68.1574% +/- 8.77623%
(Student's t, pooled s = 979.596)
x no pmc
+ pmc running
real time:
HEAD:
N Min Max Median Avg Stddev
x 10 58.38 59.15 58.86 58.847 0.22504567
+ 10 76.4 127.62 84.845 88.577 15.100031
Difference at 95.0% confidence
29.73 +/- 10.0335
50.5208% +/- 17.0525%
(Student's t, pooled s = 10.6785)
patched:
N Min Max Median Avg Stddev
x 10 58.38 59.15 58.86 58.847 0.22504567
+ 10 59.71 60.79 60.135 60.179 0.29957192
Difference at 95.0% confidence
1.332 +/- 0.248939
2.2635% +/- 0.426506%
(Student's t, pooled s = 0.264942)
system time:
HEAD:
N Min Max Median Avg Stddev
x 10 1010.15 1073.31 1025.465 1031.524 18.135705
+ 10 2277.96 6948.53 2949.47 3341.492 1385.2677
Difference at 95.0% confidence
2309.97 +/- 920.443
223.937% +/- 89.3039%
(Student's t, pooled s = 979.616)
patched:
N Min Max Median Avg Stddev
x 10 1010.15 1073.31 1025.465 1031.524 18.135705
+ 10 1038.7 1081.06 1070.555 1064.017 15.85404
Difference at 95.0% confidence
32.493 +/- 16.0042
3.15% +/- 1.5794%
(Student's t, pooled s = 17.0331)
Reviewed by: jeff@
Approved by: sbruno@
Differential Revision: https://reviews.freebsd.org/D15155
- initialize the pcpu STAILQ in the NUMA case
- don't enqueue the callback task if there isn't sufficient work to be done
Reported by: pho@
Approved by: sbruno@
- Lend priority to preempted threads in epoch_wait to handle the case
in which we've had priority lent to us. Previously we borrowed the
priority of the lowest priority preempted thread. (pointed out by mjg@)
- Don't attempt allocate memory per-domain on powerpc, we don't currently
handle empty sockets (as is the case on jhibbits Talos' board).
- Handle deferred callbacks as pcpu lists and poll the lists periodically.
Currently the interval is 1/hz.
- Drop the thread lock when adaptive spinning. Holding the lock starves
other threads and can even lead to lockups.
- Keep a generation count pcpu so that we don't keep spining if a thread
has left and re-entered an epoch section.
- Actually removed the callback from the callback list so that we don't
double free. Sigh ...
Approved by: sbruno@
Read locking is over used in the kernel to guarantee liveness. This API makes
it easy to provide livenes guarantees without atomics.
Includes epoch_test kernel module to stress test the API.
Documentation will follow initial use case.
Test case and improvements to preemption handling in response to discussion
with mjg@
Reviewed by: imp@, shurd@
Approved by: sbruno@
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see
correct flags, most important is SMAP.
Tested by: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D15367
KTRPOINT() checks both if we are tracing _and_ if we are recursing within
ktrace. The second condition is only ever executed if ktrace is actually
enabled. This change moves the check out of the hot path in to the functions
themselves.
Discussed with mjg@
Reported by: mjg@
Approved by: sbruno@
The entire mechanism is rarely used and is quite not performant due to
atomci ops on the syscall table. It also has added overhead for completely
unrelated syscalls.
Reduce it by avoiding the func calls if possible (which consistutes vast
majority of cases).
Provides about 3% syscall rate speed up for getuid on Broadwell.
If you add a child to a device that has quiet children, we'll
automatically set the quiet flag on the children, and its
children.
This is indended for things like CPU that have a large amount of
repetition in booting that adds nothing.
Boost the priority of user-space threads when they set
their affinity to a core to adjust its frequency. This avoids a situation
where a CPU bound kernel thread with the same affinity is running on a
down-clocked core, and will "block" powerd from up-clocking the core
until the kernel thread yields. This can lead to poor perfomance,
and to things potentially getting stuck on Giant.
Reviewed by: kib (imp reviewed earlier version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15246
The aim is to permit mbuf allocations after a panic without calling into
the page allocator, without imposing any runtime overhead during regular
operation of the system, and without modifying driver code. The approach
taken is to preallocate a number of mbufs and clusters, storing them
in linked lists, and using the lists to back some UMA cache zones. At
panic time, the mbuf and cluster zone pointers are overwritten with
those of the cache zones so that the mbuf allocator returns
preallocated items.
Using this scheme, drivers which cache mbuf zone pointers from
m_getzone() require special handling when implementing netdump support.
Reviewed by: cem (earlier version), julian, sbruno
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15251
For most buses all children are equal, so the order does not matter.
Other buses, such as acpi, carefully order their child devices to
express implicit dependencies between them. For such buses it is safer
to bring down devices in the reverse order.
I believe that this is the reason why hpet_suspend had to be disabled.
Some drivers depend on a working event timer until they are suspended.
But previously we would suspend hpet very early.
I tested this change by makinbg hpet_suspend actually stop HPET timers
and tested that too.
Note that this change is not a complete solution as it does not take
into account bus passes.
A better approach would be to track the actual attach order of the
devices and to use the reverse of that.
Reviewed by: imp, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15291
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.
Reviewed by: kib
Differential Revision: D14681
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.
Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280
x dup_before
+ dup_after
+------------------------------------------------------------+
| x + |
|x x x x ++ ++|
| |____AM___| |AM||
+------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08 341205.71
+ 5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08 96232.829
Difference at 95.0% confidence
3.08952e+06 +/- 365604
2.03266% +/- 0.245071%
(Student's t, pooled s = 250681)
Reported by: mjg@
MFC after: 1 week
Required MD bits are only provided for x86.
Reviewed by: jhb (previous version, as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13838
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.
Submitted by: mmacy <mmacy@mattmacy.io>
Reviewed by: shurd, sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14969
Dumpers may wish to print messages from an initialization hook; this
change ensures that such messages aren't mixed with output from the
generic dump code.
MFC after: 1 week
Previously it was possible to connect a socket (which had the
CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in
capabilties mode. This combination should be treated the same as a call
to connect (i.e. forbidden in capabilities mode). Similarly for bindat.
Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the
documentation and add tests.
PR: 222632
Submitted by: Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by: Domagoj Stolfa
MFC after: 1 week
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D15221
While here predict false.
Note the code is wrong (regardless of this change). Dereference of the
pointer can race with module unload. A fix would set the probe to a
nop stub instead of NULL.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.
Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893