Commit Graph

4521 Commits

Author SHA1 Message Date
Alexander Motin
2760658b21 Improve UMA cache reclamation.
When estimating working set size, measure only allocation batches, not free
batches.  Allocation and free patterns can be very different.  For example,
ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call,
but it does not mean it will request the same amount back that fast too, in
fact it won't.

Update working set size on every reclamation call, shrinking caches faster
under pressure.  Lack of this caused repeating vm_lowmem events squeezing
more and more memory out of real consumers only to make it stuck in UMA
caches.  I saw ZFS drop ARC size in half before previous algorithm after
periodic WSS update decided to reclaim UMA caches.

Introduce voluntary reclamation of UMA caches not used for a long time. For
each zdom track longterm minimal cache size watermark, freeing some unused
items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed
memory can get better use by other consumers.  For example, ZFS won't grow
its ARC unless it see free memory, since it does not know it is not really
used.  And even if memory is not really needed, periodic free during
inactivity periods should reduce its fragmentation.

Reviewed by:	markj, jeff (previous version)
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29790
2021-05-02 19:45:23 -04:00
Konstantin Belousov
ecfbddf0cd sysctl vm.objects: report backing object and swap use
For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object.  This allows userspace
to deconstruct the shadow chain.  Right now the handle is the address
of the object in KVA, but this is not guaranteed.

For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field.  I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.

For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29771
2021-04-19 21:32:01 +03:00
Mark Johnston
aabe13f145 uma: Introduce per-domain reclamation functions
Make it possible to reclaim items from a specific NUMA domain.

- Add uma_zone_reclaim_domain() and uma_reclaim_domain().
- Permit parallel reclamations.  Use a counter instead of a flag to
  synchronize with zone_dtor().
- Use the zone lock to protect cache_shrink() now that parallel reclaims
  can happen.
- Add a sysctl that can be used to trigger reclamation from a specific
  domain.

Currently the new KPIs are unused, so there should be no functional
change.

Reviewed by:	mav
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29685
2021-04-14 13:03:34 -04:00
Mark Johnston
54f421f9e8 uma: Split bucket_cache_drain() to permit per-domain reclamation
Note that the per-domain variant does not shrink the target bucket size.

No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-04-14 13:03:34 -04:00
Mark Johnston
2b914b85dd kmem: Add KASAN state transitions
Memory allocated with kmem_* is unmapped upon free, so KASAN doesn't
provide a lot of benefit, but since allocations are always a multiple of
the page size we can create a redzone when the allocation request size
is not a multiple of the page size.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29458
2021-04-13 17:42:21 -04:00
Mark Johnston
244f3ec642 kstack: Add KASAN state transitions
We allocate kernel stacks using a UMA cache zone.  Cache zones have
KASAN disabled by default, but in this case it makes sense to enable it.

Reviewed by:	andrew
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29457
2021-04-13 17:42:21 -04:00
Mark Johnston
09c8cb717d uma: Add KASAN state transitions
- Add a UMA_ZONE_NOKASAN flag to indicate that items from a particular
  zone should not be sanitized.  This is applied implicitly for NOFREE
  and cache zones.
- Add KASAN call backs which get invoked:
  1) when a slab is imported into a keg
  2) when an item is allocated from a zone
  3) when an item is freed to a zone
  4) when a slab is freed back to the VM

  In state transitions 1 and 3, memory is poisoned so that accesses will
  trigger a panic.  In state transitions 2 and 4, memory is marked
  valid.
- Disable trashing if KASAN is enabled.  It just adds extra CPU overhead
  to catch problems that are detected by KASAN.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29456
2021-04-13 17:42:21 -04:00
Mark Johnston
982693bb72 vm_fault: Shoot down multiply mapped COW source page mappings
Reviewed by:	kib, rlibby
Discussed with:	alc
Approved by:	so
Security:	CVE-2021-29626
Security:	FreeBSD-SA-21:08.vm
2021-04-06 14:49:28 -04:00
Konstantin Belousov
89619b747b Add sysctl debug.uma_reclaim
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-04 20:39:06 +03:00
Konstantin Belousov
51a7be5f60 Style
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-04-04 20:39:06 +03:00
Jason A. Harmening
8dc8feb53d Clean up a couple of MD warts in vm_fault_populate():
--Eliminate a big ifdef that encompassed all currently-supported
architectures except mips and powerpc32.  This applied to the case
in which we've allocated a superpage but the pager-populated range
is insufficient for a superpage mapping.  For platforms that don't
support superpages the check should be inexpensive as we shouldn't
get a superpage in the first place.  Make the normal-page fallback
logic identical for all platforms and provide a simple implementation
of pmap_ps_enabled() for MIPS and Book-E/AIM32 powerpc.

--Apply the logic for handling pmap_enter() failure if a superpage
mapping can't be supported due to additional protection policy.
Use KERN_PROTECTION_FAILURE instead of KERN_FAILURE for this case,
and note Intel PKU on amd64 as the first example of such protection
policy.

Reviewed by:	kib, markj, bdragon
Differential Revision:	https://reviews.freebsd.org/D29439
2021-03-30 18:15:55 -07:00
Konstantin Belousov
c7b913aa47 vm_fault: handle KERN_PROTECTION_FAILURE
pmap_enter(PMAP_ENTER_LARGEPAGE) may return KERN_PROTECTION_FAILURE due to
PKRU inconsistency.  Handle it in the call place from vm_fault_populate(),
and in places which decode errors from vm_fault_populate()/
vm_fault_allocate().

Reviewed by:	jah, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29442
2021-03-27 20:16:27 +02:00
Bryan Drewery
a771bf748f Remove unused obj variable missed in r354870.
Sponsored by:	Dell EMC
2021-03-17 15:29:15 -07:00
Kristof Provost
b8f7267d49 uma: allow uma_zfree_pcu(..., NULL)
We already allow free(NULL) and uma_zfree(..., NULL). Make
uma_zfree_pcpu(..., NULL) work as well.
This also means that counter_u64_free(NULL) will work.

These make cleanup code simpler.

MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29189
2021-03-12 12:12:35 +01:00
Mark Johnston
968079f253 vm_reserv: Fix list locking in vm_reserv_reclaim_contig()
The per-domain partpop queue is locked by the combination of the
per-domain lock and individual reservation mutexes.
vm_reserv_reclaim_contig() scans the queue looking for partially
populated reservations that can be reclaimed in order to satisfy the
caller's allocation.

During the scan, we drop the per-domain lock.  At this point, the rvn
pointer may be invalidated.  Take care to load rvn after re-acquiring
the per-domain lock.

While here, simplify the condition used to check whether a reservation
was dequeued while the per-domain lock was dropped.

Reviewed by:	alc, kib
Reported by:	gallatin
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29203
2021-03-11 10:35:35 -05:00
Mark Johnston
0401989282 vm: Round up npages and alignment for contig reclamation
When searching for runs to reclaim, we need to ensure that the entire
run will be added to the buddy allocator as a single unit.  Otherwise,
it will not be visible to vm_phys_alloc_contig() as it is currently
implemented.  This is a problem for allocation requests that are not a
power of 2 in size, as with 9KB jumbo mbuf clusters.

Reported by:	alc
Reviewed by:	alc
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28924
2021-03-02 10:21:02 -05:00
Max Laier
14b5a3c7d5 vm pqbatch: move unmanaged page assert under pagequeue lock
This KASSERT is overzealous because of the following race condition:
 1) A managed page which is currently in PQ_LAUNDRY is freed.
    vm_page_free_prep calls vm_page_dequeue_deferred()

    The page state is:
       PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED

 2) The laundry worker comes around and pick up the page and calls
    vm_pageout_defer(m, PQ_LAUNDRY, true) to check if page is still in the
    queue.  We do a vm_page_astate_load and get
       PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED
    as per above.

 3) The laundry worker is pre-empted and another thread allocates our page
    from the free pool.  For example vm_page_alloc_domain_after calls
    vm_page_dequeue() and sets VPO_UNMANAGED because we are allocating for
    an OBJT_UNMANAGED object.

    The page state is:
       PQ_NONE, 0 - VPO_UNMANAGED

 4) The laundry worker resumes, and processes vm_pageout_defer based on the
    stale astate which leads to a call to vm_page_pqbatch_submit, which will
    trip on the KASSERT.

Submitted by:	mlaier
Reviewed by:	markj, rlibby
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D28563
2021-02-24 15:56:16 -08:00
Mark Johnston
537f92cd35 uma: Update the comment above startup_alloc() to reflect reality
The scheme used for early slab allocations changed in commit a81c400e75.

Reported by:	alc
Reviewed by:	alc
MFC after:	1 week
2021-02-22 18:22:51 -05:00
Mark Johnston
23e875fd97 vm_kern: Avoid sign extension in the KVA_QUANTUM definition
Otherwise, on a powerpc64 NUMA system with hashed page tables, the
first-level superpage reservation size is large enough that the value of
the kernel KVA arena import quantum, KVA_NUMA_IMPORT_QUANTUM, is
negative and gets sign-extended when passed to vmem_set_import().  This
results in a boot-time hang on such platforms.

Reported by:	bdragon
MFC after:	3 days
2021-02-22 15:50:09 -05:00
Alex Richardson
fa2528ac64 Use atomic loads/stores when updating td->td_state
KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.

Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.

Reported by:	GENERIC-KCSAN
Test Plan:	KCSAN no longer complains, kernel still runs fine.
Reviewed By:	markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569
2021-02-18 14:02:48 +00:00
John Baldwin
67932460c7 Add a VA_IS_CLEANMAP() macro.
This macro returns true if a provided virtual address is contained
in the kernel's clean submap.

In CHERI kernels, the buffer cache and transient I/O map are allocated
as separate regions.  Abstracting this check reduces the diff relative
to FreeBSD.  It is perhaps slightly more readable as well.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D28710
2021-02-17 16:32:11 -08:00
Mark Johnston
5c18744ea9 vm: Honour the "noreuse" flag to vm_page_unwire_managed()
This flag indicates that the page should be enqueued near the head of
the inactive queue, skipping the LRU queue.  It is used when unwiring
pages from the buffer cache following direct I/O or after I/O when
POSIX_FADV_NOREUSE or _DONTNEED advice was specified, or when
sendfile(SF_NOCACHE) completes.  For the direct I/O and sendfile cases
we only enqueue the page if we decide not to free it, typically because
it's mapped.

Pass "noreuse" through to vm_page_release_toq() so that we actually
honour the desired LRU policy for these scenarios.

Reported by:	bdrewery
Reviewed by:	alc, kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28555
2021-02-10 11:10:27 -05:00
Ryan Stone
660344ca44 Add a VM flag to prevent reclaim on a failed contig allocation
If a M_WAITOK contig alloc fails, the VM subsystem will try to
reclaim contiguous memory twice before actually failing the
request.  On a system with 64GB of RAM I've observed this take
400-500ms before it finally gives up, and I believe that this
will only be worse on systems with even more memory.

In certain contexts this delay is extremely harmful, so add a flag
that will skip reclaim for allocation requests to allow those
paths to opt-out of doing an expensive reclaim.

Sponsored by: Dell Inc
Differential Revision:	https://reviews.freebsd.org/D28422
Reviewed by: markj, kib
2021-02-03 16:16:51 -05:00
Brooks Davis
7a1591c1b6 Rename kern_mmap_req to kern_mmap
Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap.
Reand rename kern_mmap_req to kern_mmap                                .

The helper saved some code churn initially, but having multiple
interfaces is sub-optimal.

Obtained from:	CheriBSD
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28292
2021-01-25 21:50:37 +00:00
Konstantin Belousov
420d4be3e4 vm_map_protect(): remove not needed recalculations of new_prot, new_maxprot
Requested by:	alc
Sponsored by:	The FreeBSD Foundation
2021-01-14 10:02:43 +02:00
Konstantin Belousov
0659df6fad vm_map_protect: allow to set prot and max_prot in one go.
This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome.  E.g. you can get an error that is not
possible if operation is performed atomic.

Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.

Reviewed by:	brooks, markj (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28117
2021-01-13 01:35:22 +02:00
Konstantin Belousov
9402bb44f1 vmspace_fork: preserve wx settings in the child vm map after fork
Noted by:	markj
Sponsored by:	The FreeBSD Foundation
2021-01-12 08:09:59 +02:00
Konstantin Belousov
2e1c94aa1f Implement enforcing write XOR execute mapping policy.
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28050
2021-01-12 01:15:43 +02:00
Mark Johnston
663de81f85 uma: Avoid unmapping direct-mapped slabs
startup_alloc() uses pmap_map() to map slabs used for bootstrapping the
VM.  pmap_map() may ignore the hint address and simply return a range
from the direct map.  In this case we must not unmap the range in
startup_free().

UMA uses bootstart and bootmem to track the range of KVA into which
slabs are mapped if the direct map is not used.  Unmap a startup slab
only if it was mapped into that range.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27885
2021-01-03 11:50:31 -05:00
Ryan Libby
942951ba46 uma dbg: catch more corruption with atomics
Use atomic testandset and testandclear to catch concurrent double free,
and to reduce the number of atomic operations.

Submitted by:	jeff
Reviewed by:	cem, kib, markj (all previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22703
2020-12-31 13:02:45 -08:00
Mark Johnston
81846def34 vm: Fix some bugs in the page busying code
In vm_page_busy_acquire(), load the object pointer using
atomic_load_ptr() as we do elsewhere.  Per the comment, the object
identity must be consistent across sleeps.

In vm_page_grab_sleep(), pass the correct pindex to
_vm_page_busy_sleep().  The pindex is used to re-check the page's
identity before going to sleep.  In particular, vm_page_grab_sleep() is
used in unlocked grab, so the object lock is not necessarily held when
verifying the page's identity, and the pindex may change if the page is
moved, or freed and re-allocated.  I believe this can result in spurious
VM_PAGER_FAILs from vm_page_grab_valid_unlocked() or early termination
of vm_page_grab_pages_unlocked().

In vm_page_grab_pages(), pass the correct pindex to
vm_page_grab_sleep().  Otherwise I believe vm_page_grab_pages() will
effectively spin when attempting to busy a busy page after the first
index in the range.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27607
2020-12-27 17:01:44 -05:00
Mark Johnston
d2f1c44bc9 uma: Remove the MINBUCKET flag from the flag name list
This should have been done in r368399 / commit
f8b6c51538.

Reported by:	rlibby
Sponsored by:	The FreeBSD Foundation
2020-12-27 17:01:33 -05:00
Bryan Drewery
5fee468e83 Revert r368523 which fixed contig allocs waiting forever.
This needs to account for empty NUMA domains or domains which do not
satisfy the requested range.

Discussed with:	markj
2020-12-15 19:38:16 +00:00
Bryan Drewery
bbfec1633b contig allocs: Don't retry forever on M_WAITOK.
This restores behavior from before domain iterators were added in
r327895 and r327896.

The vm_domainset_iter_policy() will do a vm_wait_doms() and then
restart its iterator when M_WAITOK is set.  It will also force
the containing loop to have M_NOWAIT.  So we get an unbounded
retry loop rather than the intended bounded retries that
kmem_alloc_contig_pages() already handles.

This also restores M_WAITOK to the vmem_alloc() call in
kmem_alloc_attr_domain() and kmem_alloc_contig_domain().

Reviewed by:	markj, kib
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D27507
2020-12-10 20:44:29 +00:00
Mark Johnston
e574d407ae uma: Make uma_zone_set_maxcache() work better with small limits
The old implementation chose the largest bucket zone such that if the
per-CPU caches are fully populated, the total number of items cached is
no larger than the specified limit.  If no such zone existed, UMA would
not do any caching.

We can now use uz_bucket_size_max to set a precise limit on the number
of items in a zone's bucket, so the total size of per-CPU caches can be
bounded more easily.  Implement a new policy in uma_zone_set_maxcache():
choose a bucket size such that up to half of the limit can be cached in
per-CPU caches, with the rest going to the full bucket cache.  This
fixes a problem with the kstack_cache zone: the limit of 4 * mp_ncpus
items meant that the zone would not do any caching, defeating the whole
purpose of the zone.  That's because the smallest bucket size holds up
to 2 items and we may cache up to 3 full buckets per CPU, and
2 * 3 * mp_ncpus > 4 * mp_ncpus.

Reported by:	mjg
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27168
2020-12-06 22:45:50 +00:00
Mark Johnston
f8b6c51538 uma: Enforce the use of uz_bucket_size_max in the free path
uz_bucket_size_max is the maximum permitted bucket size.  When filling a
new bucket to satisfy uma_zalloc(), the bucket is populated with at most
uz_bucket_size_max items.  The maximum number of entries in the bucket
may be larger.  When freeing items, however, we will fill per-CPPU
buckets up to their maximum number of entries, potentially exceeding
uz_bucket_size_max.  This makes it difficult to precisely limit the
number of items that may be cached in a zone.  For example, if one wants
to limit buckets to 1 entry for a particular zone, that's not possible
since the smallest bucket holds up to 2 entries.

Try to solve the problem by using uz_bucket_size_max to limit the number
of entries in a bucket.  Note that the ub_entries field is initialized
upon every bucket allocation.  Most zones are not affected since they do
not impose any specific limit on the maximum bucket size.

While here, remove the UMA_ZONE_MINBUCKET flag.  It was unused and we
now have uma_zone_set_maxcache() to control the zone's cache size more
precisely.

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27167
2020-12-06 22:45:39 +00:00
Mark Johnston
8a6776ca0f uma: Use atomic load for uz_sleepers
This field is updated locklessly.

Sponsored by:	The FreeBSD Foundation
2020-12-06 22:45:22 +00:00
Mark Johnston
991f23ef20 uma: Avoid allocating buckets with the cross-domain lock held
Allocation of a bucket can trigger a cross-domain free in the bucket
zone, e.g., if the per-CPU alloc bucket is empty, we free it and get
migrated to a remote domain.  This can lead to deadlocks since a bucket
zone may allocate buckets from itself or a pair of bucket zones could be
allocating from each other.

Fix the problem by dropping the cross-domain lock before allocating a
new bucket and handling refill races.  Use a list of empty buckets to
ensure that we can make forward progress.

Reported by:	imp, mjg (witness(9) warnings)
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27341
2020-11-30 16:18:33 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Mark Johnston
1fea4b25c9 Wrap a long line in vm_pqbatch_process_page() 2020-11-19 15:41:42 +00:00
Mark Johnston
9e3e737608 Micro-optimize vm_page_pqbatch_submit()
Avoid calling vm_page_domain() twice.

Discussed with:	alc (in D27207)
2020-11-19 15:40:58 +00:00
Mark Johnston
431fb8abd7 vm_phys: Try to clean up NUMA KPIs
It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures.  We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.

Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain().  Make the latter an inline function.

Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers.  Include it from vm_page.h so that
vm_page_domain() can be defined there.

Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.

Reviewed by:	alc
Reviewed by:	dougm, kib (earlier versions)
Differential Revision:	https://reviews.freebsd.org/D27207
2020-11-19 03:59:21 +00:00
Mark Johnston
20f02659d6 vm_map: Handle kernel map entry allocator recursion
On platforms without a direct map[*], vm_map_insert() may in rare
situations need to allocate a kernel map entry in order to allocate
kernel map entries.  This poses a problem similar to the one solved for
vmem boundary tags by vmem_bt_alloc().  In fact the kernel map case is a
bit more complicated since we must allocate entries with the kernel map
locked, whereas vmem can recurse into itself because boundary tags are
allocated up-front.

The solution is to add a custom slab allocator for kmapentzone which
allocates KVA directly from kernel_map, bypassing the kmem_* layer.
This avoids mutual recursion with the vmem btag allocator.  Then, when
vm_map_insert() allocates a new kernel map entry, it avoids triggering
allocation of a new slab with M_NOVM until after the insertion is
complete.  Instead, vm_map_insert() allocates from the reserve and sets
a flag in kernel_map to trigger re-population of the reserve just before
the map is unlocked.  This places an implicit upper bound on the number
of kernel map entries that may be allocated before the kernel map lock
is released, but in general a bound of 1 suffices.

[*] This also comes up on amd64 with UMA_MD_SMALL_ALLOC undefined, a
configuration required by some kernel sanitizers.

Discussed with:	kib, rlibby
Reported by:	andrew
Tested by:	pho (i386 and amd64 with !UMA_MD_SMALL_ALLOC)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26851
2020-11-11 17:16:39 +00:00
Jonathan T. Looney
7b516613aa When destroying a UMA zone which has a reserve (set with
uma_zone_reserve()), messages like the following appear on the console:
"Freed UMA keg (Test zone) was not empty (0 items). Lost 528 pages of
memory."

When keg_drain_domain() is draining the zone, it tries to keep the number
of items specified in the reservation. However, when we are destroying the
UMA zone, we do not need to keep those items. Therefore, when destroying a
non-secondary and non-cache zone, we should reset the keg reservation to 0
prior to draining the zone.

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27129
2020-11-10 18:12:09 +00:00
Mateusz Guzik
3a440a421d Add more per-cpu zones.
This covers powers of 2 up to 64.

Example pending user is ZFS.
2020-11-09 00:34:23 +00:00
Leandro Lupori
e2d6c417e3 Implement superpages for PowerPC64 (HPT)
This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.

The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.

While these changes are not better tested, superpages support is disabled by
default. To enable it, use vm.pmap.superpages_enabled=1.

In this initial implementation, when superpages are disabled, system
performance stays at the same level as without these changes. When
superpages are enabled, buildworld time increases a bit (~2%). However,
for workloads that put a heavy pressure on the TLB the performance boost
is much bigger (see HPC Challenge and pgbench on D25237).

Reviewed by:	jhibbits
Sponsored by:	Eldorado Research Institute (eldorado.org.br)
Differential Revision:	https://reviews.freebsd.org/D25237
2020-11-06 14:12:45 +00:00
Mateusz Guzik
2dee296a3d Rationalize per-cpu zones.
The 2 provided zones had inconsistent naming between each other
("int" and "64") and other allocator zones (which use bytes).

Follow malloc by naming them "pcpu-" + size in bytes.

This is a step towards replacing ad-hoc per-cpu zones with
general slabs.
2020-11-05 15:08:56 +00:00
Mark Johnston
f7db0c9532 vmspace: Convert to refcount(9)
This is mostly mechanical except for vmspace_exit().  There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace.  In that case, upon switching to
vmspace0 we can unconditionally release the reference.

Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.

Reviewed by:	alc, kib, mmel
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27057
2020-11-04 16:30:56 +00:00
Alan Cox
ccfd886a1b Conditionally compile struct vm_phys_seg's md_first field. This field is
only used by arm64's pmap.

Reviewed by:	kib, markj, scottph
Differential Revision:	https://reviews.freebsd.org/D26907
2020-10-23 06:24:38 +00:00
Ed Maste
575a4437a9 uma: fix KTR message after r366840
Reported by:	bz
Sponsored by:	The FreeBSD Foundation
2020-10-19 18:54:44 +00:00