This applies the fix in r283924 to the vm.objects sysctl
added by r283624 so the output will include the vnode
information (i.e. path) for tmpfs objects.
Reviewed by: kib, dab
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D2724
Otherwise the free page count will not accurately reflect the physical
page allocator's state. On 11 this can trigger panics in
vm_page_alloc() since the allocator state and free page count are
updated atomically and we expect them to stay in sync. On 12 the
bug would manifest as threads looping in vm_page_alloc().
PR: 231296
Reported by: mav, wollman, Rainer Duffner, Josh Gitlin
Reviewed by: alc, kib, mav
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18374
several pages, but leaves no space for struct uma_slab at the end we
miscalculate number of pages by one. Totally mimic keg_large_init() math
here to cover that problem.
Reported by: gallatin
is calculated to guarantee that struct uma_slab is placed at pointer size
alignment. Calculation of real struct uma_slab size is done in keg_ctor()
and yet again in keg_large_init(), to check if we need an extra page. This
calculation can actually be performed at compile time.
- Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at
an end of a page with alignment requirement.
- Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not
a functional change.
- Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init().
This is a potential bugfix, but in reality I don't think there are any
systems affected, since compiler aligns struct uma_slab anyway.
All vmspace_alloc() callers know which kind of pmap they allocate.
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18329
According to markj@:
pageproc contains the page daemon and laundry threads, which are
responsible for managing the LRU page queues and writing back dirty
pages. vmproc's main task is to swap out kernel stacks when the system
is under memory pressure, and swap them back in when necessary. It's a
somewhat legacy component of the system and isn't required. You can
build a kernel without it by specifying "options NO_SWAPPING" (which is
a somewhat misleading name), in which vm_swapout_dummy.c is compiled
instead of vm_swapout.c.
Based on this, we want pageproc to emulate kswapd, not vmproc.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D18061
These are used by kms-drm to determine various heuristics relate
memory conditions.
The number of free swap pages is just a variable, and it can be
much cheaper by either adding a new getter, or simply extern'ing
swap_total. However, this patch opts to use the more expensive,
existing interface - since this isn't an operation in a high per
path.
This allows us to remove some more gpl linuxkpi and do the follo
kms-drm:
git rm linuxkpi/gplv2/include/linux/swap.h
Reviewed by: mmacy, Johannes Lundberg <johalun0@gmail.com>
Approved by: emaste (mentor)
Differential Revision: https://reviews.freebsd.org/D18052
functions. Notably, reflow the text of some comments so that they
occupy fewer lines, and introduce an assertion in one of the new
helper functions so that it is not misused by a future caller.
In collaboration with: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17635
In particular, track the current size of the cache and maintain an
estimate of its working set size. This will be used to decide how
much to shrink various caches when the kernel attempts to reclaim
pages. As a secondary effect, it makes statistics aggregation (done
by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer
needs to iterate over lists of cached buckets.
Discussed with: alc, glebius, jeff
Tested by: pho (previous version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D16666
r336984 exposed the bug fixed in r340241, leading to the initial revert
while the bug was being hunted down. Now that the bug is fixed, we
can revert the revert.
Discussed with: alc
MFC after: 3 days
This was introduced in r326329 and explains the crashes mentioned in
the commit log message for r339934. In particular, on INVARIANTS
kernels, UMA trashing causes the loop to exit early, leaving swap
blocks behind when they should have been freed. After r336984 this
became more problematic since new anonymous mappings were more
likely to reuse swapped-out subranges of existing VM objects, so faults
would trigger pageins of freed memory rather than returning zeroed
pages.
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17897
These submaps are used for mapping pipe buffers and execv() argument
strings respectively, so there's no need for such mappings to have
execute permissions.
Reported by: jhb
Reviewed by: alc, jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17827
In practice it is always initialized because nfreed must be positive
in order to trigger background laundering, but this isn't obvious.
CID: 1387997
MFC after: 1 week
Initializing the eflags field of the map->header entry to a value with a
unique new bit set makes a few comparisons to &map->header unnecessary.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: alc, kib
Tested by: pho
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D14005
It appears to be responsible for random segfaults observed when lots
of paging activity is taking place, but the root cause is not yet
understood.
Requested by: alc
MFC after: now
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.
This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.
Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418
- In uma_prealloc(), we need to check for an empty domain before the
first allocation attempt, not after. Fix this by switching
uma_prealloc() to use a vm_domainset iterator, which addresses the
secondary issue of using a signed domain identifier in round-robin
iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
excluding empty domains; otherwise we may frequently specify an empty
domain when calling in to the page allocator, wasting CPU time.
Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
total page counts for now: some vm_phys segments are created before
the SRAT is parsed and are thus always identified as being in domain 0
even when they are not. Then, when bootstrap pages are freed, they
are added to a domain that we had previously thought was empty. Until
this is corrected, we simply exclude them from the per-domain page
count.
Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17704
on-write faults. On a page fault, when we call vm_fault_prefault(), it
probes the pmap and the shadow chain of vm objects to see if there are
opportunities to create read and/or execute-only mappings to neighoring
pages. For example, in the case of hard faults, such effort typically pays
off, that is, mappings are created that eliminate future soft page faults.
However, in the the case of soft, copy-on-write faults, the effort very
rarely pays off. (See the review for some specific data.)
Reviewed by: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D17367
Previously, it used a hand-rolled round-robin iterator. This meant that
the minskip logic in r338507 didn't apply to UMA allocations, and also
meant that we would call vm_wait() for individual domains rather than
permitting an allocation from any domain with sufficient free pages.
Discussed with: jeff
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17420
Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc". The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy. Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.
In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations. To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.
Reviewed by: jeff (previous version)
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17417
This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration. With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.
This method still has some performance overhead relative to disabling
NUMA support at compile time.
PR: 231460
Reviewed by: alc, gallatin, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17439
I committed some patches out of order and didn't build-test one of them.
Reported by: Jenkins, O. Hartmann <ohartmann@walstatt.org>
X-MFC with: r339601
On NUMA systems, we would not swap in processes unless all domains
had some free pages. This is too conservative in general. Instead,
permit swapins so long as at least one domain has free pages, and add
a kernel stack NUMA policy which ensures that we will try to allocate
kernel stack pages from any domain.
Reported and tested by: pho, Jan Bramkamp <crest@bultmann.eu>
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17304
cache, then we put new bucket on generic bucket cache. However, code didn't
honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache
on a zone that clearly forbids that. Fix this.
Reviewed by: markj
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.
The refactoring will make it easier to add NUMA support for other
architectures.
No functional change intended.
Reviewed by: alc, gallatin, jeff, kib
Tested by: pho (part of a larger patch)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17416
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.
Results in mmap speed up and a 70% increase in brk calls / second.
Reviewed by: alc@, markj@, kib@
Approved by: re (delphij@)
Differential Revision: https://reviews.freebsd.org/D16273
Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range
[0, MAXMEMDOM).
Reported by: pho
Reviewed by: kib
Approved by: re (rgrimes)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17374
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.
Add support to FreeBSD for empty NUMA domains by:
- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.
Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683
after the file mapping was wired.
if a wired map entry is backed by vnode and the file is truncated,
corresponding pages are invalidated. vm_fault_copy_entry() should be
aware of it and allow for invalid pages past end of file. Also, such
pages should be not mapped into userspace. If userspace accesses the
truncated part of the mapping later, it gets a signal, there is no way
kernel can prevent the page fault.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
if the dst_object is not of swap type.
It can only happen when entry does not require copy, otherwise
vm_map_protect() already adds the charge. So the assert was right for
the case where swap object was allocated in the vm_fault_copy_entry(),
but not when it was just copied from src_entry and its type is not
swap.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
allocated dst_object in a single place.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.
Reviewed by: alc, jeff, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17278
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone. (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)
Reviewed by: kib, markj
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17296
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.
Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.
Reported by: alc
Reviewed by: alc, kib (previous version)
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17249
Ensure that pages backing the same virtual large page come from the
same physical domain, as kmem_malloc_domain() does.
PR: 231038
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17248
This keeps the initialization coupled together with the kmem_* KPI
implementation, which is the main user of these arenas.
No functional change intended.
Reviewed by: alc
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17247
The current cache logic checks the total number of stacks in the kernel,
which even on small boxes significantly exceeds the 128 limit (e.g. an
8-way box with zfs has almost 800 stacks allocated).
Stacks are cached earlier for each main thread.
As a result the code is rarely executed, but when it is then (on boxes like
the above) it always fails. Since there are no provisions made for NUMA and
release time is approaching, just do a quick check to avoid acquiring the
lock.
Approved by: re (kib)
We drop the keg lock when we go to actually allocate the slab, allowing
other threads to advance the cursor. This can cause us to exit the
round-robin loop before having attempted allocations from all domains,
resulting in a hang during a subsequent blocking allocation attempt from
a depleted domain.
Reported and tested by: Jan Bramkamp <crest@bultmann.eu>
Reviewed by: alc, cem
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17209
Limits can be safely obtained with lim_cur from the thread. racct is compiled
in but disabled by default. Note that racct enablement is a boot-only tunable.
This eliminates second most common place of taking the lock while pkg building.
While here don't take the lock in mlockall either.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17210
No functional change intended.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17028
While executing vm_pqbatch_process_page(m), m->queue may change to
PQ_NONE if the page daemon is concurrently freeing the page. In this
case m's queue state flags must be clear, so vm_pqbatch_process_page()
will be a no-op, but the race could cause spurious assertion failures.
Correct the assertion which assumed that m->queue's value does not
change while the page queue lock is held.
Reviewed by: alc, kib
Reported and tested by: pho
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17027