4146 Commits

Author SHA1 Message Date
markj
f2fc591269 Add a !NUMA definition for vm_domainset_iter_policy_ref_init().
Pointy hat:	markj
X-MFC with:	r339661
Sponsored by:	The FreeBSD Foundation
2018-10-24 17:09:20 +00:00
markj
239b1074f3 Add an #include required after r339686.
X-MFC with:	r339686
Sponsored by:	The FreeBSD Foundation
2018-10-24 16:49:16 +00:00
markj
ac2a239a60 Use a vm_domainset iterator in keg_fetch_slab().
Previously, it used a hand-rolled round-robin iterator.  This meant that
the minskip logic in r338507 didn't apply to UMA allocations, and also
meant that we would call vm_wait() for individual domains rather than
permitting an allocation from any domain with sufficient free pages.

Discussed with:	jeff
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17420
2018-10-24 16:41:47 +00:00
markj
875935f8c5 Initialize static domainsets regardless of whether an SRAT is present.
Reported by:	yuripv
X-MFC with:	r339452
Sponsored by:	The FreeBSD Foundation
2018-10-23 18:07:16 +00:00
markj
6c262608dd Refactor domainset iterators for use by malloc(9) and UMA.
Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc".  The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy.  Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.

In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations.  To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.

Reviewed by:	jeff (previous version)
Tested by:	pho (part of a larger patch)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17417
2018-10-23 16:35:58 +00:00
markj
0dd92926d8 Make it possible to disable NUMA support with a tunable.
This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration.  With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.

This method still has some performance overhead relative to disabling
NUMA support at compile time.

PR:		231460
Reviewed by:	alc, gallatin, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17439
2018-10-22 20:13:51 +00:00
markj
816ca801ac Fix the build after r339601.
I committed some patches out of order and didn't build-test one of them.

Reported by:	Jenkins, O. Hartmann <ohartmann@walstatt.org>
X-MFC with:	r339601
2018-10-22 17:19:48 +00:00
markj
87de4fcb0a Avoid a redundancy in a comment updated by r339601.
Reported by:	alc
X-MFC with:	r339601
2018-10-22 17:17:30 +00:00
markj
9ce499cca5 Swap in processes unless there's a global memory shortage.
On NUMA systems, we would not swap in processes unless all domains
had some free pages.  This is too conservative in general.  Instead,
permit swapins so long as at least one domain has free pages, and add
a kernel stack NUMA policy which ensures that we will try to allocate
kernel stack pages from any domain.

Reported and tested by:	pho, Jan Bramkamp <crest@bultmann.eu>
Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17304
2018-10-22 17:04:04 +00:00
glebius
e30533f381 If we lost race or were migrated during bucket allocation for the per-CPU
cache, then we put new bucket on generic bucket cache. However, code didn't
honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache
on a zone that clearly forbids that. Fix this.

Reviewed by:	markj
2018-10-22 15:48:07 +00:00
kib
1a44e90fd3 Unindent vm_map_simplify_entry() after r339506.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17632
2018-10-21 00:11:56 +00:00
kib
125d9d8c57 Reduce code duplication in merging vm_entry neighbors.
Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17610
2018-10-20 23:08:04 +00:00
markj
8998a23151 Create some global domainsets and refactor NUMA registration.
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.

The refactoring will make it easier to add NUMA support for other
architectures.

No functional change intended.

Reviewed by:	alc, gallatin, jeff, kib
Tested by:	pho (part of a larger patch)
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17416
2018-10-20 17:36:00 +00:00
mmacy
9a1ff096d9 eliminate locking surrounding ui_vmsize and swap reserve by using atomics
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.

Results in mmap speed up and a 70% increase in brk calls / second.

Reviewed by:	alc@, markj@, kib@
Approved by:	re (delphij@)
Differential Revision:	https://reviews.freebsd.org/D16273
2018-10-05 05:50:56 +00:00
markj
a19efb5e69 Use an unsigned iterator for domain sets.
Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range
[0, MAXMEMDOM).

Reported by:	pho
Reviewed by:	kib
Approved by:	re (rgrimes)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17374
2018-10-01 18:51:39 +00:00
gallatin
770bffbee6 Allow empty NUMA memory domains to support Threadripper2
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2  memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.

Add support to FreeBSD for empty NUMA domains by:

- creating empty memory domains when parsing the SRAT table,
    rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
    Thanks to Jeff for suggesting this strategy.

Reviewed by:	alc, markj
Approved by:	re (gjb@)
Differential Revision:	https://reviews.freebsd.org/D1683
2018-10-01 14:14:21 +00:00
kib
56403981a0 Correct vm_fault_copy_entry() handling of backing file truncation
after the file mapping was wired.

if a wired map entry is backed by vnode and the file is truncated,
corresponding pages are invalidated.  vm_fault_copy_entry() should be
aware of it and allow for invalid pages past end of file. Also, such
pages should be not mapped into userspace.  If userspace accesses the
truncated part of the mapping later, it gets a signal, there is no way
kernel can prevent the page fault.

Reported by:	andrew using syzkaller
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17323
2018-09-28 14:11:38 +00:00
kib
3b71d845e6 In vm_fault_copy_entry(), we should not assert that entry is charged
if the dst_object is not of swap type.

It can only happen when entry does not require copy, otherwise
vm_map_protect() already adds the charge. So the assert was right for
the case where swap object was allocated in the vm_fault_copy_entry(),
but not when it was just copied from src_entry and its type is not
swap.

Reported by:	andrew using syzkaller
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17323
2018-09-28 14:11:01 +00:00
kib
06ff3bcf36 In vm_fault_copy_entry(), collect the code to initialize a newly
allocated dst_object in a single place.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17323
2018-09-28 14:10:12 +00:00
markj
625c57d19c Add more NUMA-specific low memory predicates.
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.

Reviewed by:	alc, jeff, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17278
2018-09-24 19:24:17 +00:00
alc
337ee8ed9f Passing UMA_ZONE_NOFREE to uma_zcreate() for swpctrie_zone and swblk_zone is
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone.  (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)

Reviewed by:	kib, markj
Approved by:	re (gjb)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17296
2018-09-24 16:49:02 +00:00
markj
9d6288c080 Ensure that "domain" is initialized when vm_ndomains == 1.
Reported by:	alc
Approved by:	re (gjb)
2018-09-24 15:32:46 +00:00
markj
c030a808b9 Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned.
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.

Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.

Reported by:	alc
Reviewed by:	alc, kib (previous version)
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17249
2018-09-20 18:29:55 +00:00
markj
9557a686f9 Change the domain selection policy in kmem_back().
Ensure that pages backing the same virtual large page come from the
same physical domain, as kmem_malloc_domain() does.

PR:		231038
Reviewed by:	alc, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17248
2018-09-20 15:45:12 +00:00
markj
b6e6107289 Move kernel vmem arena initialization to vm_kern.c.
This keeps the initialization coupled together with the kmem_* KPI
implementation, which is the main user of these arenas.

No functional change intended.

Reviewed by:	alc
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17247
2018-09-19 19:13:43 +00:00
mjg
e4a8d038e7 vm: check for empty kstack cache before locking
The current cache logic checks the total number of stacks in the kernel,
which even on small boxes significantly exceeds the 128 limit (e.g. an
8-way box with zfs has almost 800 stacks allocated).

Stacks are cached earlier for each main thread.

As a result the code is rarely executed, but when it is then (on boxes like
the above) it always fails. Since there are no provisions made for NUMA and
release time is approaching, just do a quick check to avoid acquiring the
lock.

Approved by:	re (kib)
2018-09-19 16:02:33 +00:00
markj
351e1e4099 Only update the domain cursor once in keg_fetch_slab().
We drop the keg lock when we go to actually allocate the slab, allowing
other threads to advance the cursor.  This can cause us to exit the
round-robin loop before having attempted allocations from all domains,
resulting in a hang during a subsequent blocking allocation attempt from
a depleted domain.

Reported and tested by:	Jan Bramkamp <crest@bultmann.eu>
Reviewed by:	alc, cem
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17209
2018-09-18 17:51:45 +00:00
mjg
1851883043 vm: stop taking proc lock in mmap to satisfy racct if it is disabled
Limits can be safely obtained with lim_cur from the thread. racct is compiled
in but disabled by default. Note that racct enablement is a boot-only tunable.

This eliminates second most common place of taking the lock while pkg building.

While here don't take the lock in mlockall either.

Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17210
2018-09-18 01:24:30 +00:00
markj
74aba93ee7 Split some checks in vm_page_activate() to make it easier to read.
No functional change intended.

Reviewed by:	alc, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17028
2018-09-10 18:59:23 +00:00
markj
933d701a36 Relax an assertion in vm_pqbatch_process_page().
While executing vm_pqbatch_process_page(m), m->queue may change to
PQ_NONE if the page daemon is concurrently freeing the page.  In this
case m's queue state flags must be clear, so vm_pqbatch_process_page()
will be a no-op, but the race could cause spurious assertion failures.
Correct the assertion which assumed that m->queue's value does not
change while the page queue lock is held.

Reviewed by:	alc, kib
Reported and tested by:	pho
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17027
2018-09-08 21:49:43 +00:00
markj
e2445c9ca3 Use the correct terminology.
Reported by:    kib
Approved by:	re (gjb)
Differential revision:  https://reviews.freebsd.org/D16191
2018-09-06 20:02:19 +00:00
markj
f193d69fe6 Avoid resource deadlocks when one domain has exhausted its memory. Attempt
other allowed domains if the requested domain is below the minimum paging
threshold.  Block in fork only if all domains available to the forking
thread are below the severe threshold rather than any.

Submitted by:	jeff
Reported by:	mjg
Reviewed by:	alc, kib, markj
Approved by:	re (rgrimes)
Differential Revision:	https://reviews.freebsd.org/D16191
2018-09-06 19:28:52 +00:00
markj
e2dd7215d6 Remove vm_page_remque().
Testing m->queue != PQ_NONE is not sufficient; see the commit log
message for r338276.  As of r332974 vm_page_dequeue() handles
already-dequeued pages, so just replace vm_page_remque() calls with
vm_page_dequeue() calls.

Reviewed by:	kib
Tested by:	pho
Approved by:	re (marius)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17025
2018-09-06 16:17:45 +00:00
alc
5e70e46481 Recent changes have created, for the first time, physical memory segments
that can be coalesced.  To be clear, fragmentation of phys_avail[] is not
the cause.  This fragmentation of vm_phys_segs[] arises from the "special"
calls to vm_phys_add_seg(), in other words, not those that derive directly
from phys_avail[], but those that we create for the initial kernel page
table pages and now for the kernel and modules loaded at boot time.  Since
we sometimes iterate over the physical memory segments, coalescing these
segments at initialization time is a worthwhile change.

Reviewed by:	kib, markj
Approved by:	re (rgrimes)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16976
2018-09-02 18:29:38 +00:00
kib
acc0761238 Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines.
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.

Based on the submission by:	royger
Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
Approved by:	re (marius)
Differential revision:	https://reviews.freebsd.org/D16881
2018-08-29 12:24:19 +00:00
markm
d8723e8b03 Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR:		230870
Reviewed by:	cem
Approved by:	so(delphij,gtetlow)
Approved by:	re(marius)
Differential Revision:	https://reviews.freebsd.org/D16898
2018-08-26 12:51:46 +00:00
alc
3799d78beb Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions.  Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX.  The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by:	kib, markj
Discussed with:	jeff
Approved by:	re (marius)
Differential Revision:	https://reviews.freebsd.org/D16845
2018-08-25 19:38:08 +00:00
glebius
729ef97091 Either "free" or "allocated" is misleading here, since an item
in a bucket is free from perspective of UMA consumer, and it is
allocated from perspective of keg.

Discussed with:	markj
Approved by:	re (kib)
2018-08-24 18:47:50 +00:00
glebius
3b5a53cba0 Fix comment. The actual meaning of ub_cnt is the opposite. 2018-08-23 23:24:28 +00:00
markj
6ee3780664 Add a per-pagequeue pdpages counter.
Expose these counters under the vm.domain sysctl node.  The existing
vm.stats.vm.v_pdpages sysctl is preserved.

Reviewed by:	alc (previous version)
Differential Revision:	https://reviews.freebsd.org/D14666
2018-08-23 21:03:45 +00:00
markj
fd4c15bf76 Ensure that queue state is cleared when vm_page_dequeue() returns.
Per-page queue state is updated non-atomically, with either the page
lock or the page queue lock held.  When vm_page_dequeue() is called
without the page lock, in rare cases a different thread may be
concurrently dequeuing the page with the pagequeue lock held.  Because
of the non-atomic update, vm_page_dequeue() might return before queue
state is completely updated, which can lead to race conditions.

Restrict the vm_page_dequeue() interface so that it must be called
either with the page lock held or on a free page, and busy wait when
a different thread is concurrently updating queue state, which must
happen in a critical section.

While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked().  Replace the volatile qualifier for "queue"
added in r333703 with explicit uses of atomic_load_8() where required.

Reported and tested by:	pho
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D15980
2018-08-23 20:34:22 +00:00
alc
4ce21fcbea Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)

Reviewed by:	kib, markj
Discussed with:	jeff, re@
Differential Revision:	https://reviews.freebsd.org/D16825
2018-08-21 16:43:46 +00:00
alc
71b5b012c4 Eliminate kmem_alloc_contig()'s unused arena parameter.
Reviewed by:	hselasky, kib, markj
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D16799
2018-08-20 15:57:27 +00:00
alc
f8b2118de8 Eliminate the unused arena parameter from kmem_alloc_attr().
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D16793
2018-08-18 22:07:48 +00:00
alc
dce7d8625a Eliminate the arena parameter to kmem_malloc_domain(). It is redundant.
The domain and flags parameters suffice.  In fact, the related functions
kmem_alloc_{attr,contig}_domain() don't have an arena parameter.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D16713
2018-08-18 18:33:50 +00:00
kib
21fe622aa6 Prevent some parallel swap-ins, rate-limit swapper swap-ins.
If faultin() was called outside swapper (from PHOLD()), do not allow
swapper to initiate additional swap-ins.  Swapper' initiated swap-ins
are serialized because they are synchronous and executed in the
context of the thread0.  With the added limitation, we only allow
parallel swap-ins from PHOLD(), which is up to PHOLD() users to
manage, usually they do not need to.

Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds
interval, counting faultin() swapins.

Suggested by:	alc
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16610
2018-08-13 16:48:46 +00:00
markj
02291bbeb3 Account for the lowmem handlers in the inactive queue scan target.
Before r329882 the target would be computed after lowmem handlers run
and free pages.  On some systems a significant amount of page
reclamation happens this way.  However, with r329882 the target is
computed first, which can lead to unnecessary reclamation from the
page cache, and this in turn may result in excessive swapping.

Instead, adjust the target after running lowmem handlers.  Don't
invoke the lowmem handlers before the PID controller, though, since
that would hide the true rate of page allocation.

Reviewed by:	alc, kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16606
2018-08-09 18:25:49 +00:00
alc
4cde6a5313 Add support for pmap_enter(..., psind=1) to the armv6 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 1 MB page
mapping.  (Essentially, this feature allows the machine-independent layer
to create superpage mappings preemptively, and not wait for automatic
promotion to occur.)

Export pmap_ps_enabled() to the machine-independent layer.

Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail
or reclaim a PV entry when one is not available.

Refactor pmap_enter_pte1() into two functions, one by the same name, that
is a general-purpose function for creating pte1 mappings, and another,
pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute-
only mappings for execve(2), mmap(2), and shmat(2).

In addition, as an optimization to pmap_enter(..., psind=0), eliminate the
use of pte2_is_managed() from pmap_enter().  Unlike the x86 pmap
implementations, armv6 does not have a managed bit defined within the PTE.
So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n)
in the number of vm_phys_segs[].  All but one call to PHYS_TO_VM_PAGE() in
pmap_enter() can be avoided.

Reviewed by:	kib, markj, mmel
Tested by:	mmel
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D16555
2018-08-08 16:55:01 +00:00
alc
25277b8f8a Defer and aggregate swap_pager_meta_build frees.
Before swp_pager_meta_build replaces an old swapblk with an new one,
it frees the old one.  To allow such freeing of blocks to be
aggregated, have swp_pager_meta_build return the old swap block, and
make the caller responsible for freeing it.

Define a pair of short static functions, swp_pager_init_freerange and
swp_pager_update_freerange, to do the initialization and updating of
blk addresses and counters used in aggregating blocks to be freed.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	kib, markj (an earlier version)
Tested by:	pho
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13707
2018-08-08 02:30:34 +00:00
kib
4d78ed404a Swap in WKILLED processes.
Swapped-out process that is WKILLED must be swapped in as soon as
possible.  The reason is that such process can be killed by OOM and
its pages can be only freed if the process exits.  To exit, the kernel
stack of the process must be mapped.

When allocating pages for the stack of the WKILLED process on swap in,
use VM_ALLOC_SYSTEM requests to increase the chance of the allocation
to succeed.

Add counter of the swapped out processes to avoid unneeded iteration
over the allprocs list when there is no work to do, reducing the
allproc_lock ownership.

Reviewed by:	alc, markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16489
2018-08-04 20:45:43 +00:00