Maintain a count of free slabs in the per-domain keg structure and use
that to clear the free slab list in constant time for most cases. This
helps minimize lock contention induced by reclamation, in preparation
for proactive trimming of excesses of free memory.
Reviewed by: jeff, rlibby
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23532
Currently, the vm.panic_on_oom sysctl is a boolean which controls the
behavior of the VM system when it encounters an out-of-memory situation.
If set to 0, the VM system kills the largest process. If set to any other
value, the VM system will initiate a panic.
This change makes the sysctl a count of events. If set to 0, the VM system
kills the largest process. If set to any other value, the VM system will
kill the largest process until it has seen the specified number of
out-of-memory events. Once it reaches the specified number of events, it
will initiate a panic.
This change is helpful in capturing cores when the system is in a perpetual
cycle of out-of-memory events (as opposed to just hitting one or two
sporadic out-of-memory events).
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23601
UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but
with a more confusing name. Remove the flag, make UMA_ZONE_VM an
inherit flag, and replace all references.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23516
Add a switch to allow disabling multipage slabs, in order to facilitate
measuring memory usage and performance effects. The tunable
vm.debug.uma_multipage_slabs defaults to 1 and can be set to 0 to
disable. The name may change soon.
Reviewed by: markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23487
Memory efficiency can be poor with awkward item sizes (e.g. 1/2 or 1
page size + epsilon). In order to achieve a minimum memory efficiency,
select a slab size with a potentially larger number of pages if it
yields a lower portion of waste.
This may mean using page_alloc instead of uma_small_alloc, which could
be more costly.
Discussed with: jeff, mckusick
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23239
After r357392, it is apparent that we do have some early-boot PCPU
zones. Make it so we can safely free pages from them if they are
actually used during early boot.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23496
potential bugs that access freed pages as well as providing a path
towards lockless page lookup.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23444
system. Small bucket sizes already pack well even if they are an odd
number of words. This prevents any potential new instances of the
problem fixed in r357463 as well as making the system easier to
understand.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23494
objects backing tmpfs vnodes data.
The clean scan is limited to only remove write permissions from the
mapped pages of the objects. This fixes the issue that tmpfs vnode
mtime is not updated from writes to the mmaped area after the initial
page-in.
Noted by: mjg
Reviewed by: markj
Discussed with: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D23432
and this is more space efficient.
Stop queueing recently used buckets to the head of the list. If the bucket
goes to a different processor the cache coherency will be more expensive.
We already try to encourage cache-hot behavior in the per-cpu layer.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D23493
With r357314, sizeof(struct uma_bucket) grew to 16 bytes on 32-bit
platforms, so BUCKET_SIZE(4) is 0. This resulted in the creation of a
bucket zone for buckets with zero capacity. A more general fix is
planned, but for now this bandaid allows 32-bit platforms to boot again.
PR: 243837
Discussed with: jeff
Reported by: pho, Jenkins via lwhsu
Tested by: pho
Sponsored by: The FreeBSD Foundation
Update vm_page_scan_contig() and vm_page_reclaim_run() to stop using
vm_page_change_lock(). It has no use after r356157. Remove
vm_page_change_lock() now that it has no users.
Remove an unncessary check for wirings in vm_page_scan_contig(), which
was previously checking twice. The check is racy until
vm_page_reclaim_run() ensures that the page is unmapped, so one check is
sufficient.
Reviewed by: jeff, kib (previous versions)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23279
This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm. This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost. The memory overhead
is significantly lessened by limiting the free-to-use latency. A synthetic
test uses 1/20th of the memory vs Epoch. There is significant further
discussion in the comments and code review.
This code should be considered experimental. I will write a man page after
it has settled. After further validation the VM will begin using this
feature to permit lockless page lookups.
Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures. I will commit a stress testing
tool in a follow-up.
Reviewed by: mmacy, markj, rlibby, hselasky
Discussed with: sbahara
Differential Revision: https://reviews.freebsd.org/D22586
Right now OOM is initiated unconditionally on the page allocation
failure, after the wait.
Reported by: Mark Millard <marklmi@yahoo.com>
Reviewed by: cy, markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D23409
Both vm_object_scan_all_shadowed() and vm_object_collapse_scan() might
observe an invalid page left in the default backing object by the
fault handler that retried. Check for the condition and refuse to collapse.
Reported and tested by: pho
Reviewed by: jeff
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D23331
an inline function vm_map_lookup_clip_start that invokes them both and
use it in places that invoke both. Drop a couple of local variables
made unnecessary by this function.
Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22987
A submap can only be created from an entry spanning the entire request
range. In particular, if vm_map_lookup_entry() returns false or the
returned entry contains "end".
Since the only use of submaps in FreeBSD is for the static pipe and
execve argument KVA maps, this has no functional effect.
Github PR: https://github.com/freebsd/freebsd/pull/420
Submitted by: Wuyang Chung <wuyang.chung1@gmail.com> (original)
Reviewed by: dougm, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23299
Add a new VM return code KERN_RESTART which means, deallocate and restart in
fault.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23303
This additionally fixes a potential bug/pessimization where we could fail to
reload the original fault_type on restart.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23301
UMA zone structures have two arrays at the end which are sized according
to the machine: an array of CPU count length, and an array of NUMA
domain count length. The CPU counting was wrong in the case where some
CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the
second array to be overlaid with the first.
Reported by: olivier
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23318
Previously UMA had some false negatives in the leak report at keg
destruction time, where it only reported leaks if there were free items
in the slab layer (rather than allocated items), which notably would not
be true for single-item slabs (large items). Now, report a leak if
there are any allocated pages, and calculate and report the number of
allocated items rather than free items.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23275
no longer need an object lock. This reduces the longest hold times and
eliminates some trylock code blocks.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23034
The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033
paging.
Shadow objects are marked with a COLLAPSING flag while they are collapsing with
their backing object. This gives us an explicit test rather than overloading
paging-in-progress. While split is on-going we mark an object with SPLIT.
These two operations will modify the swap tree so they must be serialized
and swap_pager_getpages() can now directly detect these conditions and page
more conservatively.
Callers to vm_object_collapse() now will reliably wait for a collapse to finish
so that the backing chain is as short as possible before other decisions are
made that may inflate the object chain. For example, split, coalesce, etc.
It is now safe to run fault concurrently with collapse. It is safe to increase
or decrease paging in progress with no lock so long as there is another valid
ref on increase.
This change makes collapse more reliable as a secondary benefit. The primary
benefit is making it safe to drop the object lock much earlier in fault or
never acquire it at all.
This was tested with a new shadow chain test script that uncovered long
standing bugs and will be integrated with stress2.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22908
Some systems, such as higher end Threadripper, may have
NUMA domains with no physical memory, Don't allocate
from these domains.
This fixes a "panic: vm_wait in early boot" on my 2990WX desktop
Reviewed by: jeff
Sponsored by: Netflix
page that was previously mapped read-only it exists in pmap until pmap_enter()
returns. However, we held no reference to the original page after the copy
was complete. This allowed vm_object_scan_all_shadowed() to collapse an
object that still had pages mapped. To resolve this, add another page pointer
to the faultstate so we can keep the page xbusy until we're done with
pmap_enter(). Handle busy pages in scan_all_shadowed. This is already done
in vm_object_collapse_scan().
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23155
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.
Parts of this patch were written and tested by markj.
Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102
r355004 removed return statement from this loop with intention to also
call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes
infinite loop.
Reviewed by: markj
Sponsored by: iXsystems, Inc.
By allowing more items per slab, we can improve memory efficiency for
small allocs. If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory. So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab). The
practical effect should be reduced memory usage for counter(9).
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23149
respectively. The tunable controls how big is the size of per-cpu
vm page cache. Previously the value was split for all CPUs in system,
so configuring same value on machines with different count of CPUs
yielded in different cache size available to a particular CPU.
Reviewed by: markj
Obtained from: Netflix