The first call is used to gauge how much spaces is needed. Just computing
the size instead of generating the output allows to not take the proctree
lock.
use it to regulate page daemon output.
This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.
Reviewed by: bsdimp
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14402
search for a thread to steal inside a critical section. Since this
allows the search to be preempted, restart the search if preemption
happens since the search results found earlier may no longer be
valid.
Decrease the latency of starting a thread that may be assigned to
this CPU during the search by polling for incoming threads during
the search and switching to that thread instead of continuing the
search.
Test for stale search results and restart the search before going
through the expense of calling tdq_lock_pair(). Retry some tests
after grabbing the locks since things may have changed while waiting
to get both locks.
Eliminate special case handling for stealing from an SMT peer that
uses 1 as the steal threshold. This can only succeed if a thread
has been assigned but our SMT peer has not yet started executing
it. This is quite rare and when it happens the other SMT thread
is generally waiting for the same tdq lock that we hold. Basically
both SMT threads are racing to grab the same spin lock.
Add the kern.sched.always_steal knob from a ULE patch by jeff@.
Incorporate another idea from Jeff's ULE patch. If the sched_switch()
detects that the CPU is about to go idle, try to steal a thread
before switching to the idle thread. Since the search for a thread
to steal has to be done inside a critical section in this context,
limit the impact on latency by adding the knob kern.sched.trysteal_limit
to limit the topological distance of the search and don't restart
the search if we detect stale results. If this search can't find
an stealable thread, the idle loop can do a more complete search.
Also poll for threads being assigned to this CPU during the search
and switch to them instead of continuing the search. This change
is responsibile for the majority of the improvement in parallel
buildworld times.
In sched_balance_group() change the minimum threshold from stealing
a thread from 1 to 2. Poaching a newly assigned thread from a CPU
that is waking up hasn't yet switched to that thread from idle is
likely very rare and is likely to have the same lock race as is
seen when stealing threads in the idle loop. Also use tdq_notify()
to kick the destintation CPU instead of always sending an IPI.
Update a stale comment, the number of transferable threads is not
calculated.
Reviewed by: kib (earlier version)
Comments by: avg, jeff, mav
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12130
Remove the unused syscall_(de)register() functions in favor of the
better documented and easier to use syscall_helper_(un)register(9)
functions.
The default and freebsd32 versions differed in which array of struct
sysents they used and a few missing updates to the 32-bit code as
features were added to the main code.
Reviewed by: cem
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14337
There is a proctree -> allproc ordering established.
Most of the time it is either xlock -> xlock or slock -> slock.
On fork however there is a slock -> xlock pair which results in
pathological wait times due to threads keeping proctree held for
reading and all waiting on allproc. Switch this to xlock -> xlock.
Longer term fix would get rid of proctree in this place to begin with.
Right now it is necessary to walk the session/process group lists to
determine which id is free. The walk can be avoided e.g. with bitmaps.
The exit path used to have one place which dealt with allproc and
then with proctree. Move the allproc acquire into the section protected
by proctree. This reduces contention against threads waiting on proctree
in the fork codepath - the fork proctree holder does not have to wait
for allproc as often.
Finally, move tidhash manipulation outside of the area protected by
either of these locks. The removal from the hash was already unprotected.
There is no legitimate reason to look up thread ids for a process still
under construction.
This results in about 50% wait time reduction during -j 128 package build.
Provide multiple clean queues partitioned into 'domains'. Each domain manages
its own bufspace and has its own bufspace daemon. Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.
Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.
Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.
Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.
Reviewed by: markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14274
The race manifested itself mostly in terms of crashes with "spin lock
held too long".
Relevant parts of respective code paths:
exit: reap:
PROC_LOCK(p);
PROC_SLOCK(p);
p->p_state == PRS_ZOMBIE
PROC_UNLOCK(p);
PROC_LOCK(p);
/* exit work */
if (p->p_state == PRS_ZOMBIE) /* true */
proc_reap()
free proc
/* more exit work */
PROC_SUNLOCK(p);
Thus a still exiting process is reaped.
Prior to the change the zombie check was followed by slock/sunlock trip
which prevented the problem.
Even code prior to this commit has a bug: the proc is still accessed for
statistic collection purposes. However, the severity is rather small and
the bug may be fixed in a future commit.
Reported by: many
Tested by: allanjude
The primitive can be used to wait for the lock to be released. Intended
usage is for locks in structures which are about to be freed.
The benefit is the avoided interrupt enable/disable trip + atomic op to
grab the lock and shorter wait if the lock is held (since there is no
worry someone will contend on the lock, re-reads can be more aggressive).
Briefly discussed with: kib
The suspension counter needs synchronisation through slock, but we don't
need it to check if inspecting the counter is necessary to begin with.
In the common case it is not, thus avoid the lock if possible.
Reviewed by: kib
Tested by: pho
The description of kern.ipc.shmsegs was wrong since 2005. I updated the
others (which were more correct) to match.
PR: 225933
Reviewed by: cem
MFC after: 3 days
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14391
With the option used to compile the kernel both sx and rw shared ops would
always go to the slow path which added avoidable overhead even when the
facility is disabled.
Furthermore the increased time spent doing uncontested shared lock acquire
would be bogusly added to total wait time, somewhat skewing the results.
Restore old behaviour of going there only when profiling is enabled.
This change is a no-op for kernels without LOCK_PROFILING (which is the
default).
Add const to new kern_ functions and push down as required.
Reviewed by: rwatson
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14174
sbt is the time in the future that the tsleep_sbt() is expected to be completed
at. sbtt is the current time. Depending on the precision with sysctl
kern.timecounter.alloweddeviation the start time may be incremented by
tc_tick_sbt. The same increment is needed for the current time of sbtt before
calculating the difference. The impact of missing this increment is that rmtp
may increase by one tc_tick_sbt on every early [EINTR] return. If the same
struct is passed in for rqtp as rmtp this can result in rqtp effectively
incrementing by tc_tick_sbt and sleeping longer than originally intended.
This problem was introduced in r247797.
Reviewed by: kib, markj, vangyzen (all on an older version of the test)
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D14362
This works similarly to the existing gzip compression support, but
zstd is typically faster and gives better compression ratios.
Support for this functionality must be configured by adding ZSTDIO to
one's kernel configuration file. dumpon(8)'s new -Z option is used to
configure zstd compression for kernel dumps. savecore(8) now recognizes
and saves zstd-compressed kernel dumps with a .zst extension.
Submitted by: cem (original version)
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13101,
https://reviews.freebsd.org/D13633
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
read or write of all registered realtime clocks. In the read case, the
values read are simply discarded. For writes, there's no alternative but
to actually write the current system time to the device.
to make the order of operations clearer to avoid the race condition
that was fixed in r328914. In particular, this commit corrects a
similar race that existed in the soft updates callback.
Doing some sleuthing through the SVN repository, it appears that
bufdone_finish() was added to support XFS:
------------------------------------------------------------------------
r153192 | rodrigc | 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) | 13 lines
Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
- 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
- b_pin_count, count of pinned buffer
- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions
Patches provided by: kan
Reviewed by: phk
Silence on: arch@
------------------------------------------------------------------------
It does not appear to ever have been used for anything else. XFS was
disconnected in r241607:
------------------------------------------------------------------------
r241607 | attilio | 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) | 5 lines
Disconnect non-MPSAFE XFS from the build in preparation for dropping
GIANT from VFS.
This is not targeted for MFC.
------------------------------------------------------------------------
and removed entirely in r247631:
------------------------------------------------------------------------
r247631 | attilio | 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) | 5 lines
Garbage collect XFS bits which are now already completely disconnected
from the tree since few months.
This is not targeted for MFC.
------------------------------------------------------------------------
Since XFS support is gone, there is no reason to retain biodone_finish().
Suggested by: Warner Losh (imp)
Discussed with: cem, kib
Tested by: Peter Holm (pho)
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch
off of this zone from startup_alloc() until full launch of VM.
o Always supply number of VM zones to uma_startup_count(). On machines
with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
a page. In the latter case account VM zones for number of allocations
from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
itself any zone that is already capable of running real alloc.
In worst case scenario we may leak a single page here. See comment
in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
pages calculation, we are guaranteed to use all of the boot_pages
in the early single threaded stage.
Reported & tested by: mav
While the bug itself was serious, as we could either pass a non-busied
page to vm_pager_get_pages() or leak a busy page, it could only be
triggered under a very rare condition where the page is already inserted
into the object, but it is not valid yet.
Reviewed by: kib
MFC after: 2 weeks
o Most of startup zones have struct uma_slab embedded into the slab,
so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
when calculating how many pages would certain kind of allocations
require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
and this also needs to be accounted for. [2]
While here, spread more comments and improve diagnostic messages.
Reported by: pho [1], jtl [2]
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.
The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.
The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.
Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
o Call uma_startup1() after initializing kmem, vmem and domains.
o Include 8 eight VM startup pages into uma_startup_count() calculation.
o Account for vmem_startup() and vm_map_startup() preallocating pages.
o Account for extra two allocations done by kmem_init() and vmem_create().
o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT
allowed several other SYSINITs to sneak in before it, thus bumping
requirement for amount of boot pages.
for UMA startup.
o Introduce another stage of UMA startup, which is entered after
vm_page_startup() finishes. After this stage we don't yet enable buckets,
but we can ask VM for pages. Rename stages to meaningful names while here.
New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
BOOT_RUNNING.
Enabling page alloc earlier allows us to dramatically reduce number of
boot pages required. What is more important number of zones becomes
consistent across different machines, as no MD allocations are done before
the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
startup_alloc(), however that may change, so vm_page_startup() provides
its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
functions calculates sizes of zones zone and kegs zone, and calculates how
many pages UMA will need to bootstrap.
It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
but also as args.size.
Reviewed by: imp, gallatin (earlier version)
Differential Revision: https://reviews.freebsd.org/D14054