Both vm_object_scan_all_shadowed() and vm_object_collapse_scan() might
observe an invalid page left in the default backing object by the
fault handler that retried. Check for the condition and refuse to collapse.
Reported and tested by: pho
Reviewed by: jeff
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D23331
an inline function vm_map_lookup_clip_start that invokes them both and
use it in places that invoke both. Drop a couple of local variables
made unnecessary by this function.
Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22987
A submap can only be created from an entry spanning the entire request
range. In particular, if vm_map_lookup_entry() returns false or the
returned entry contains "end".
Since the only use of submaps in FreeBSD is for the static pipe and
execve argument KVA maps, this has no functional effect.
Github PR: https://github.com/freebsd/freebsd/pull/420
Submitted by: Wuyang Chung <wuyang.chung1@gmail.com> (original)
Reviewed by: dougm, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23299
Add a new VM return code KERN_RESTART which means, deallocate and restart in
fault.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23303
This additionally fixes a potential bug/pessimization where we could fail to
reload the original fault_type on restart.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23301
UMA zone structures have two arrays at the end which are sized according
to the machine: an array of CPU count length, and an array of NUMA
domain count length. The CPU counting was wrong in the case where some
CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the
second array to be overlaid with the first.
Reported by: olivier
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23318
Previously UMA had some false negatives in the leak report at keg
destruction time, where it only reported leaks if there were free items
in the slab layer (rather than allocated items), which notably would not
be true for single-item slabs (large items). Now, report a leak if
there are any allocated pages, and calculate and report the number of
allocated items rather than free items.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23275
no longer need an object lock. This reduces the longest hold times and
eliminates some trylock code blocks.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23034
The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033
paging.
Shadow objects are marked with a COLLAPSING flag while they are collapsing with
their backing object. This gives us an explicit test rather than overloading
paging-in-progress. While split is on-going we mark an object with SPLIT.
These two operations will modify the swap tree so they must be serialized
and swap_pager_getpages() can now directly detect these conditions and page
more conservatively.
Callers to vm_object_collapse() now will reliably wait for a collapse to finish
so that the backing chain is as short as possible before other decisions are
made that may inflate the object chain. For example, split, coalesce, etc.
It is now safe to run fault concurrently with collapse. It is safe to increase
or decrease paging in progress with no lock so long as there is another valid
ref on increase.
This change makes collapse more reliable as a secondary benefit. The primary
benefit is making it safe to drop the object lock much earlier in fault or
never acquire it at all.
This was tested with a new shadow chain test script that uncovered long
standing bugs and will be integrated with stress2.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22908
Some systems, such as higher end Threadripper, may have
NUMA domains with no physical memory, Don't allocate
from these domains.
This fixes a "panic: vm_wait in early boot" on my 2990WX desktop
Reviewed by: jeff
Sponsored by: Netflix
page that was previously mapped read-only it exists in pmap until pmap_enter()
returns. However, we held no reference to the original page after the copy
was complete. This allowed vm_object_scan_all_shadowed() to collapse an
object that still had pages mapped. To resolve this, add another page pointer
to the faultstate so we can keep the page xbusy until we're done with
pmap_enter(). Handle busy pages in scan_all_shadowed. This is already done
in vm_object_collapse_scan().
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23155
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.
Parts of this patch were written and tested by markj.
Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102
r355004 removed return statement from this loop with intention to also
call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes
infinite loop.
Reviewed by: markj
Sponsored by: iXsystems, Inc.
By allowing more items per slab, we can improve memory efficiency for
small allocs. If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory. So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab). The
practical effect should be reduced memory usage for counter(9).
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23149
respectively. The tunable controls how big is the size of per-cpu
vm page cache. Previously the value was split for all CPUs in system,
so configuring same value on machines with different count of CPUs
yielded in different cache size available to a particular CPU.
Reviewed by: markj
Obtained from: Netflix
Some kernel subsystems, notably ZFS, will destroy UMA zones from a
shutdown eventhandler. This causes the zone to be drained. For slabs
that are mapped into KVA this can be very expensive and so it needlessly
delays the shutdown process.
Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once
kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy()
into a no-op, provided that the zone does not have a custom finalization
routine.
PR: 242427
Reviewed by: jeff, kib, rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23066
Unify the keg layout selection paths (keg_small_init, keg_large_init,
keg_cachespread_init), and slightly improve memory efficiecy by:
- using the padding of the final item to store the slab header,
- not going OFFPAGE if we have a choice unless it improves efficiency.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23048
- Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC.
- Move flag VTOSLAB from public to private.
- Introduce public NOTPAGE flag and make HASH private.
- Introduce public NOTOUCH flag and make OFFPAGE private.
- Update man page.
The net effect of this should be to make the contract with clients more
clear. Clients should choose constraints, UMA will figure out how to
implement them. This also breaks the confusing double meaning of
OFFPAGE.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23016
MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt
of zones while here. This was already correct because uz_cpu strongly
aligned the zone structure but the specified alignment did not match
reality and involved redundant defines.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23046
Linux mmap rejects mmap() on a write-only file with EACCES.
linux_mmap_common currently does a fun dance to grab the fp associated with
the passed in fd, validates it, then drops the reference and calls into
kern_mmap(). Doing so is perhaps both fragile and premature; there's still
plenty of chance for the request to get rejected with a more appropriate
error, and it's prone to a race where the file we ultimately mmap has
changed after it drops its referenced.
This change alleviates the need to do this by providing a kern_mmap variant
that allows the caller to inspect the fp just before calling into the fileop
layer. The callback takes flags, prot, and maxprot as one could imagine
scenarios where any of these, in conjunction with the file itself, may
influence a caller's decision.
The file type check in the linux compat layer has been removed; EINVAL is
seemingly not an appropriate response to the file not being a vnode or
device. The fileop layer will reject the operation with ENODEV if it's not
supported, which more closely matches the common linux description of
mmap(2) return values.
If we discover that we're allowing an mmap() on a file type that Linux
normally wouldn't, we should restrict those explicitly.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22977
UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that
violated the new assert. Resolve this by rewriting the COLD asserts to
look at the per-cpu allocation counts for evidence of api activity.
Discussed with: rlibby
Reviewed by: markj
Reported by: lwhsu
more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and
UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending
on kernel configuration. API users need only specify one if they want to
override the default.
Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off
of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA
is.
Reviewed by: markj
Discussed with: rlibby
Differential Revision: https://reviews.freebsd.org/D22831
onto their respective bucket lists. This is a several order of magnitude
improvement in contention on the keg lock under heavy free traffic while
requiring only an additional bucket per-domain worth of memory.
Discussed with: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22830
accounting for each NUMA domain. Independent keg domain locks are important
with cross-domain frees. Hashed zones are non-numa and use a single keg
lock to protect the hash table.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22829
between populating buckets from the slab layer and fetching full buckets
from the zone layer. Eliminate some nonsense locking patterns where
we lock to fetch a single variable.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22828
sleepq to serialize sleepers. This patch retains the existing sleep/wakeup
paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup
in one case but otherwise should be bug for bug compatible. In particular,
there are still various races surrounding adjusting the limit via sysctl
that are now documented.
Discussed with: markj
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22827
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
The page daemon loops may move pages back to the active queue if
references are detected. In this case we must take care to clear
existing queue operation flags. In particular, PGA_REQUEUE_HEAD may be
set, and that flag is only valid if the page belongs to the inactive
queue.
Also fix a bug in the active queue scan where we were updating "old"
instead of "new". This would only have been hit in rare cases where the
page moved out of the active queue after the beginning of the scan.
Reported by: Bob Prohaska, Idwer Vollering
Tested by: Idwer Vollering
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D23001
entry in the vm_map, making invariants related to the max_free entry
field invalid. Move the clipping work into vm_map_entry_link, so that
linking is okay when the new entry clips a current entry, and the
vm_map doesn't have to be briefly corrupted. Change assertions and
conditions in SPLAY_{LEFT,RIGHT}_STEP since the max_free invariants
can now be trusted in all cases.
Tested by: pho
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D22897
We now set PGA_DEQUEUE on a managed page when it is wired after
allocation, and vm_page_mvqueue() ignores pages with this flag set,
ensuring that they do not end up in the page queues. However, this is
not sufficient for managed fictitious pages or pages managed by the
TTM. In particular, the TTM makes use of the plinks.q queue linkage
fields for its own purposes.
PR: 242961
Reported and tested by: Greg V <greg@unrelenting.technology>
This fixes a regression in r356155, introduced at the last minute. In
particular, we must clear PGA_REQUEUE_HEAD before inserting into any
queue besides PQ_INACTIVE since that operation is implemented only for
PQ_INACTIVE.
Reported by: pho, Jenkins via lwhsu
The previous series of patches orphaned some vm_page functions, so
remove them.
Reviewed by: dougm, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22886
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885