In r353734 the use of the page caches was limited to systems with a
relatively large amount of RAM per CPU. This was to mitigate some
issues reported with the system not able to keep up with memory pressure
in cases where it had been able to do so prior to the addition of the
direct free pool cache. This change re-enables those caches.
The change modifies uma_zone_set_maxcache(), which was introduced
specifically for the page cache zones. Rather than using it to limit
only the full bucket cache, have it also set uz_count_max to provide an
upper bound on the per-CPU cache size that is consistent with the number
of items requested. Remove its return value since it has no use.
Enable the page cache zones unconditionally, and limit them to 0.1% of
the domain's pages. The limit can be overridden by the
vm.pgcache_zone_max tunable as before.
Change the item size parameter passed to uma_zcache_create() to the
correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page
cache buckets to be adaptively sized, like the rest of UMA's caches.
This also causes the initial bucket size to be small, so only systems
which benefit from large caches will get them.
Reviewed by: gallatin, jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22393
so that we avoid the hashtables. The hashtable is now only required if
a zone is created with OFFPAGE specified initially, not internally. This
flag signals to UMA that it can't touch the allocated memory and so
can't store a slab pointer in the containing page.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22453
The r354367 is reverted since it is subsumed by this, more complete, approach.
Suggested by: markj
Reviewed by: alc. glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22242
Early counter mock can be only used on BSP for amd64, when APs try to
update it that causes random memory corruption.
N.B. This is a temporary patch to plug the corruption for now, while
a proper solution for handling cache zones in zone_foreach() is being
developed.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
The correctness of per-CPU cache accounting in that function is
dependent on reading per-CPU pointers exactly once. Ensure that
the compiler does not emit multiple loads of those pointers.
Reported and tested by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22081
cy@ points out that I got parameter order backwards between definition and
invocation of the helper function. He is totally correct. The earlier
version of this patch predated the XFree column so this is one I introduced,
rather than the original author.
Submitted by: cy
Reported by: cy
X-MFC-With: r353429
Add /i option for machine-parseable CSV output. This allows ready copy/
pasting into more sophisticated tooling outside of DDB.
Add total zone size ("Memory Use") as a new column for UMA.
For both, sort the displayed list on size (print the largest zones/types
first). This is handy for quickly diagnosing "where has my memory gone?" at
a high level.
Submitted by: Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version)
Sponsored by: Dell EMC Isilon
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.
With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.
Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.
Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667
The kernel thread stack zone performs first-touch allocations by
default, and must handle the case where the local memory domain
is empty. For most UMA zones this is handled in the keg layer,
but cache zones currently must implement a policy for this case.
Simply use a round-robin policy if UMA_ANYDOMAIN is passed.
Reported and tested by: bcran
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
- UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that
was freed on a different domain from where it was allocated. This is
only used for UMA_ZONE_NUMA (first-touch) zones.
- UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all
zones. This tries to maintain locality for kernel memory.
Reviewed by: gallatin, alc, kib
Tested by: pho, gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20929
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone. But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place. No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20470
ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables.
With initial hash table size of only 32 elements it takes ~20 expansions
or ~400 seconds to adapt to handling 220GB ZFS ARC. During that time not
only the hash table is highly inefficient, but also each of those expan-
sions takes significant time with the lock held, blocking operation.
On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change
reduces time needed to first time read 240GB from ~300-400s, during which
system is quite busy and unresponsive, to only ~150s with light CPU load
and just 5 sub-second CPU spikes to expand the hash table.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
As a followup to r343673, unsign some variables related to allocation
since the hashsize cannot be negative. This gives a bit more space to
handle bigger allocations and avoid some implicit casting.
While here also unsign uh_hashmask, it makes little sense to keep that
signed.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19148
512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items. To manage
them UMA tries to allocate 2GB hash table, which size does not fit into
the int variable, causing later allocation failure, which makes ARC shrink
back below the 512GB, not letting it to use more RAM. With this change I
easily reached >700GB ARC size on 768GB RAM machine.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Then bucket_alloc() also selects bucket size based on uz_count. However,
since zone lock is dropped, uz_count may reduce. In this case max may
be greater than ub_entries and that would yield into writing beyond end
of the allocation.
Reported by: pho
The iterator should be reinitialized after every successful slab
allocation. A request to advance the iterator is interpreted as
an allocation failure, so a sufficiently large preallocation would
cause the iterator to believe that all domains were exhausted,
resulting in a sleep with the keg lock held. [1]
Also, keg_alloc_slab() should pass the unmodified wait flag to the
item initialization routine, which may use it to perform allocations
from other zones.
Reported and tested by: slavah
Diagnosed by: kib [1]
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
This reduces amount of locking required for these zones.
Also, for cache only zones (UMA_ZFLAG_CACHE) accounting uz_items wasn't
correct at all, since they may allocate items directly from their backing
store and then free them via UMA underflowing uz_items.
Tested by: pho
atomic updates and reduces amount of data protected by zone lock.
During startup point these fields to EARLY_COUNTER. After startup
allocate them for all early zones.
Tested by: pho
two zones sharing a keg may have different limits. Now this is going
to work:
zone = uma_zcreate();
uma_zone_set_max(zone, limit);
zone2 = uma_zsecond_create(zone);
uma_zone_set_max(zone2, limit2);
Kegs no longer have uk_maxpages field, but zones have uz_items. When
set, it may be rounded up to minimum possible CPU bucket cache size.
For small limits bucket cache can also be reconfigured to be smaller.
Counter uz_items is updated whenever items transition from keg to a
bucket cache or directly to a consumer. If zone has uz_maxitems set and
it is reached, then we are going to sleep.
o Since new limits don't play well with multi-keg zones, remove them. The
idea of multi-keg zones was introduced exactly 10 years ago, and never
have had a practical usage. In discussion with Jeff we came to a wild
agreement that if we ever want to reintroduce the idea of a smart allocator
that would be able to choose between two (or more) totally different
backing stores, that choice should be made one level higher than UMA,
e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
by the caller.
o Sleeping code is improved to account number of sleepers and wake them one
by one, to avoid thundering herd problem.
o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
KPI added. Having no bucket cache basically means setting maxcache to 0.
o Now with many fields added and many removed (no multi-keg zones!) make
sure that struct uma_zone is perfectly aligned.
Reviewed by: markj, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D17773
several pages, but leaves no space for struct uma_slab at the end we
miscalculate number of pages by one. Totally mimic keg_large_init() math
here to cover that problem.
Reported by: gallatin
is calculated to guarantee that struct uma_slab is placed at pointer size
alignment. Calculation of real struct uma_slab size is done in keg_ctor()
and yet again in keg_large_init(), to check if we need an extra page. This
calculation can actually be performed at compile time.
- Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at
an end of a page with alignment requirement.
- Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not
a functional change.
- Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init().
This is a potential bugfix, but in reality I don't think there are any
systems affected, since compiler aligns struct uma_slab anyway.
In particular, track the current size of the cache and maintain an
estimate of its working set size. This will be used to decide how
much to shrink various caches when the kernel attempts to reclaim
pages. As a secondary effect, it makes statistics aggregation (done
by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer
needs to iterate over lists of cached buckets.
Discussed with: alc, glebius, jeff
Tested by: pho (previous version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D16666
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.
This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.
Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418
- In uma_prealloc(), we need to check for an empty domain before the
first allocation attempt, not after. Fix this by switching
uma_prealloc() to use a vm_domainset iterator, which addresses the
secondary issue of using a signed domain identifier in round-robin
iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
excluding empty domains; otherwise we may frequently specify an empty
domain when calling in to the page allocator, wasting CPU time.
Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
total page counts for now: some vm_phys segments are created before
the SRAT is parsed and are thus always identified as being in domain 0
even when they are not. Then, when bootstrap pages are freed, they
are added to a domain that we had previously thought was empty. Until
this is corrected, we simply exclude them from the per-domain page
count.
Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17704
Previously, it used a hand-rolled round-robin iterator. This meant that
the minskip logic in r338507 didn't apply to UMA allocations, and also
meant that we would call vm_wait() for individual domains rather than
permitting an allocation from any domain with sufficient free pages.
Discussed with: jeff
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17420
cache, then we put new bucket on generic bucket cache. However, code didn't
honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache
on a zone that clearly forbids that. Fix this.
Reviewed by: markj
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.
Add support to FreeBSD for empty NUMA domains by:
- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.
Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683
We drop the keg lock when we go to actually allocate the slab, allowing
other threads to advance the cursor. This can cause us to exit the
round-robin loop before having attempted allocations from all domains,
resulting in a hang during a subsequent blocking allocation attempt from
a depleted domain.
Reported and tested by: Jan Bramkamp <crest@bultmann.eu>
Reviewed by: alc, cem
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17209
given in random(4).
This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.
Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.
PR: 230870
Reviewed by: cem
Approved by: so(delphij,gtetlow)
Approved by: re(marius)
Differential Revision: https://reviews.freebsd.org/D16898
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().
Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.
Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.
Update the nearby comment for UMA_SLAB_KERNEL.
Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)
Reviewed by: kib, markj
Discussed with: jeff, re@
Differential Revision: https://reviews.freebsd.org/D16825
The domain and flags parameters suffice. In fact, the related functions
kmem_alloc_{attr,contig}_domain() don't have an arena parameter.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D16713
r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the
backend allocator for PCPU UMA zones. Unlike page_alloc(), it does
not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that.
r336020 also changed counter(9) to initialize each counter using a
CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU,
smp_rendezvous() will only execute the callback on the current CPU
(i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed
by virtue of the fact that UMA gratuitously zeroes slabs when importing
them into a zone.
Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't
zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no
effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by
iterating over the full range of CPU IDs when zeroing counters,
ignoring whether the corresponding bits in all_cpus are set.
Reported and tested by: pho (previous version)
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D16190
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)
- Allocate page from the correct domain for a given cpu.
- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.
The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.
Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933
Most kernel memory that is allocated after boot does not need to be
executable. There are a few exceptions. For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9). The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.
(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory. This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory. This change makes the behavior consistent.)
This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified. After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.
Allocations that do need executable memory have various choices. They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.
Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.
PR: 228927
Reviewed by: alc, kib, markj, jhb (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15691
Per-cpu zone allocations are very rarely done compared to regular zones.
The intent is to avoid pessimizing the latter case with per-cpu specific
code.
In particular contrary to the claim in r334824, M_ZERO is sometimes being
used for such zones. But the zeroing method is completely different and
braching on it in the fast path for regular zones is a waste of time.