Suppose a thread is running on a CPU in a NUMA domain with no physical
RAM. When an item is freed to a first-touch zone, it ends up in the
cross-domain bucket. When the bucket is full, it gets placed in another
domain's bucket queue. However, when allocating an item, UMA will
always go to the keg upon a per-CPU cache miss because the empty
domain's bucket queue will always be empty. This means that a non-empty
domain's bucket queues can grow very rapidly on such systems. For
example, it can easily cause mbuf allocation failures when the zone
limit is reached.
Change cache_alloc() to follow a round-robin policy when running on an
empty domain.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25355
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
Otherwise anything counted before SI_SUB_VM_CONF is discarded. However,
it is useful to be able to see stats from allocations done early during
boot.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24756
This makes it easier to write libkvm programs that access UMA data
structures.
- Remove a couple of unused slab functions and make others local to
uma_core.c. Similarly move SLAB_BITSETS, which affects the layout of
slab structures, to uma_core.c.
- Stop defining the slab structures under _KERNEL. There's no real
reason they can't be visible to userspace like the rest of UMA's
structures are.
- Group KEG_ASSERT_COLD with other keg macros.
- Convert an assertion about MAXMEMDOM to use _Static_assert.
No functional change intended.
Discussed with: jeff
Reviewed by: rlibby
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23980
Swap buckets on free as well as alloc so that alloc is always the most
cache-hot data.
When selecting a zone domain for the round-robin bucket cache use the
local domain unless there is a severe imbalance. This does not affinitize
memory, only locks and queues.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23824
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
This enables very cheap read sections with free-to-use latencies and memory
overhead similar to epoch. On a recent AMD platform a read section cost
1ns vs 5ns for the default SMR. On Xeon the numbers should be more like 1
ns vs 11. The memory consumption should be proportional to the product
of the free rate and 2*1/hz while normal SMR consumption is proportional
to the product of free rate and maximum read section time.
While here refactor the code to make future additions more
straightforward.
Name the overall technique Global Unbound Sequences (GUS) and adjust some
comments accordingly. This helps distinguish discussions of the general
technique (SMR) vs this specific implementation (GUS).
Discussed with: rlibby, markj
This gives much better concurrency when there are a large number of
cores per-domain and multiple domains. Avoid taking the lock entirely
if it will not be productive. ROUNDROBIN domains will have mixed
memory in each domain and will load balance to all domains.
While here refactor the zone/domain separation and bucket limits to
simplify callers.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23673
Maintain a count of free slabs in the per-domain keg structure and use
that to clear the free slab list in constant time for most cases. This
helps minimize lock contention induced by reclamation, in preparation
for proactive trimming of excesses of free memory.
Reviewed by: jeff, rlibby
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23532
UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but
with a more confusing name. Remove the flag, make UMA_ZONE_VM an
inherit flag, and replace all references.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23516
Add a switch to allow disabling multipage slabs, in order to facilitate
measuring memory usage and performance effects. The tunable
vm.debug.uma_multipage_slabs defaults to 1 and can be set to 0 to
disable. The name may change soon.
Reviewed by: markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23487
Memory efficiency can be poor with awkward item sizes (e.g. 1/2 or 1
page size + epsilon). In order to achieve a minimum memory efficiency,
select a slab size with a potentially larger number of pages if it
yields a lower portion of waste.
This may mean using page_alloc instead of uma_small_alloc, which could
be more costly.
Discussed with: jeff, mckusick
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23239
After r357392, it is apparent that we do have some early-boot PCPU
zones. Make it so we can safely free pages from them if they are
actually used during early boot.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23496
system. Small bucket sizes already pack well even if they are an odd
number of words. This prevents any potential new instances of the
problem fixed in r357463 as well as making the system easier to
understand.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23494
and this is more space efficient.
Stop queueing recently used buckets to the head of the list. If the bucket
goes to a different processor the cache coherency will be more expensive.
We already try to encourage cache-hot behavior in the per-cpu layer.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D23493
With r357314, sizeof(struct uma_bucket) grew to 16 bytes on 32-bit
platforms, so BUCKET_SIZE(4) is 0. This resulted in the creation of a
bucket zone for buckets with zero capacity. A more general fix is
planned, but for now this bandaid allows 32-bit platforms to boot again.
PR: 243837
Discussed with: jeff
Reported by: pho, Jenkins via lwhsu
Tested by: pho
Sponsored by: The FreeBSD Foundation
This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm. This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost. The memory overhead
is significantly lessened by limiting the free-to-use latency. A synthetic
test uses 1/20th of the memory vs Epoch. There is significant further
discussion in the comments and code review.
This code should be considered experimental. I will write a man page after
it has settled. After further validation the VM will begin using this
feature to permit lockless page lookups.
Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures. I will commit a stress testing
tool in a follow-up.
Reviewed by: mmacy, markj, rlibby, hselasky
Discussed with: sbahara
Differential Revision: https://reviews.freebsd.org/D22586
UMA zone structures have two arrays at the end which are sized according
to the machine: an array of CPU count length, and an array of NUMA
domain count length. The CPU counting was wrong in the case where some
CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the
second array to be overlaid with the first.
Reported by: olivier
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23318
Previously UMA had some false negatives in the leak report at keg
destruction time, where it only reported leaks if there were free items
in the slab layer (rather than allocated items), which notably would not
be true for single-item slabs (large items). Now, report a leak if
there are any allocated pages, and calculate and report the number of
allocated items rather than free items.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23275
Some systems, such as higher end Threadripper, may have
NUMA domains with no physical memory, Don't allocate
from these domains.
This fixes a "panic: vm_wait in early boot" on my 2990WX desktop
Reviewed by: jeff
Sponsored by: Netflix
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.
Parts of this patch were written and tested by markj.
Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102
By allowing more items per slab, we can improve memory efficiency for
small allocs. If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory. So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab). The
practical effect should be reduced memory usage for counter(9).
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23149
Some kernel subsystems, notably ZFS, will destroy UMA zones from a
shutdown eventhandler. This causes the zone to be drained. For slabs
that are mapped into KVA this can be very expensive and so it needlessly
delays the shutdown process.
Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once
kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy()
into a no-op, provided that the zone does not have a custom finalization
routine.
PR: 242427
Reviewed by: jeff, kib, rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23066
Unify the keg layout selection paths (keg_small_init, keg_large_init,
keg_cachespread_init), and slightly improve memory efficiecy by:
- using the padding of the final item to store the slab header,
- not going OFFPAGE if we have a choice unless it improves efficiency.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23048
- Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC.
- Move flag VTOSLAB from public to private.
- Introduce public NOTPAGE flag and make HASH private.
- Introduce public NOTOUCH flag and make OFFPAGE private.
- Update man page.
The net effect of this should be to make the contract with clients more
clear. Clients should choose constraints, UMA will figure out how to
implement them. This also breaks the confusing double meaning of
OFFPAGE.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23016
MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt
of zones while here. This was already correct because uz_cpu strongly
aligned the zone structure but the specified alignment did not match
reality and involved redundant defines.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23046
UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that
violated the new assert. Resolve this by rewriting the COLD asserts to
look at the per-cpu allocation counts for evidence of api activity.
Discussed with: rlibby
Reviewed by: markj
Reported by: lwhsu
more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and
UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending
on kernel configuration. API users need only specify one if they want to
override the default.
Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off
of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA
is.
Reviewed by: markj
Discussed with: rlibby
Differential Revision: https://reviews.freebsd.org/D22831
onto their respective bucket lists. This is a several order of magnitude
improvement in contention on the keg lock under heavy free traffic while
requiring only an additional bucket per-domain worth of memory.
Discussed with: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22830
accounting for each NUMA domain. Independent keg domain locks are important
with cross-domain frees. Hashed zones are non-numa and use a single keg
lock to protect the hash table.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22829
between populating buckets from the slab layer and fetching full buckets
from the zone layer. Eliminate some nonsense locking patterns where
we lock to fetch a single variable.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22828