no longer need an object lock. This reduces the longest hold times and
eliminates some trylock code blocks.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23034
The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033
paging.
Shadow objects are marked with a COLLAPSING flag while they are collapsing with
their backing object. This gives us an explicit test rather than overloading
paging-in-progress. While split is on-going we mark an object with SPLIT.
These two operations will modify the swap tree so they must be serialized
and swap_pager_getpages() can now directly detect these conditions and page
more conservatively.
Callers to vm_object_collapse() now will reliably wait for a collapse to finish
so that the backing chain is as short as possible before other decisions are
made that may inflate the object chain. For example, split, coalesce, etc.
It is now safe to run fault concurrently with collapse. It is safe to increase
or decrease paging in progress with no lock so long as there is another valid
ref on increase.
This change makes collapse more reliable as a secondary benefit. The primary
benefit is making it safe to drop the object lock much earlier in fault or
never acquire it at all.
This was tested with a new shadow chain test script that uncovered long
standing bugs and will be integrated with stress2.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22908
Some systems, such as higher end Threadripper, may have
NUMA domains with no physical memory, Don't allocate
from these domains.
This fixes a "panic: vm_wait in early boot" on my 2990WX desktop
Reviewed by: jeff
Sponsored by: Netflix
page that was previously mapped read-only it exists in pmap until pmap_enter()
returns. However, we held no reference to the original page after the copy
was complete. This allowed vm_object_scan_all_shadowed() to collapse an
object that still had pages mapped. To resolve this, add another page pointer
to the faultstate so we can keep the page xbusy until we're done with
pmap_enter(). Handle busy pages in scan_all_shadowed. This is already done
in vm_object_collapse_scan().
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23155
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.
Parts of this patch were written and tested by markj.
Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102
r355004 removed return statement from this loop with intention to also
call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes
infinite loop.
Reviewed by: markj
Sponsored by: iXsystems, Inc.
By allowing more items per slab, we can improve memory efficiency for
small allocs. If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory. So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab). The
practical effect should be reduced memory usage for counter(9).
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23149
respectively. The tunable controls how big is the size of per-cpu
vm page cache. Previously the value was split for all CPUs in system,
so configuring same value on machines with different count of CPUs
yielded in different cache size available to a particular CPU.
Reviewed by: markj
Obtained from: Netflix
Some kernel subsystems, notably ZFS, will destroy UMA zones from a
shutdown eventhandler. This causes the zone to be drained. For slabs
that are mapped into KVA this can be very expensive and so it needlessly
delays the shutdown process.
Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once
kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy()
into a no-op, provided that the zone does not have a custom finalization
routine.
PR: 242427
Reviewed by: jeff, kib, rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23066
Unify the keg layout selection paths (keg_small_init, keg_large_init,
keg_cachespread_init), and slightly improve memory efficiecy by:
- using the padding of the final item to store the slab header,
- not going OFFPAGE if we have a choice unless it improves efficiency.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23048
- Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC.
- Move flag VTOSLAB from public to private.
- Introduce public NOTPAGE flag and make HASH private.
- Introduce public NOTOUCH flag and make OFFPAGE private.
- Update man page.
The net effect of this should be to make the contract with clients more
clear. Clients should choose constraints, UMA will figure out how to
implement them. This also breaks the confusing double meaning of
OFFPAGE.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23016
MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt
of zones while here. This was already correct because uz_cpu strongly
aligned the zone structure but the specified alignment did not match
reality and involved redundant defines.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23046
Linux mmap rejects mmap() on a write-only file with EACCES.
linux_mmap_common currently does a fun dance to grab the fp associated with
the passed in fd, validates it, then drops the reference and calls into
kern_mmap(). Doing so is perhaps both fragile and premature; there's still
plenty of chance for the request to get rejected with a more appropriate
error, and it's prone to a race where the file we ultimately mmap has
changed after it drops its referenced.
This change alleviates the need to do this by providing a kern_mmap variant
that allows the caller to inspect the fp just before calling into the fileop
layer. The callback takes flags, prot, and maxprot as one could imagine
scenarios where any of these, in conjunction with the file itself, may
influence a caller's decision.
The file type check in the linux compat layer has been removed; EINVAL is
seemingly not an appropriate response to the file not being a vnode or
device. The fileop layer will reject the operation with ENODEV if it's not
supported, which more closely matches the common linux description of
mmap(2) return values.
If we discover that we're allowing an mmap() on a file type that Linux
normally wouldn't, we should restrict those explicitly.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22977
UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that
violated the new assert. Resolve this by rewriting the COLD asserts to
look at the per-cpu allocation counts for evidence of api activity.
Discussed with: rlibby
Reviewed by: markj
Reported by: lwhsu
more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and
UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending
on kernel configuration. API users need only specify one if they want to
override the default.
Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off
of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA
is.
Reviewed by: markj
Discussed with: rlibby
Differential Revision: https://reviews.freebsd.org/D22831
onto their respective bucket lists. This is a several order of magnitude
improvement in contention on the keg lock under heavy free traffic while
requiring only an additional bucket per-domain worth of memory.
Discussed with: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22830
accounting for each NUMA domain. Independent keg domain locks are important
with cross-domain frees. Hashed zones are non-numa and use a single keg
lock to protect the hash table.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22829
between populating buckets from the slab layer and fetching full buckets
from the zone layer. Eliminate some nonsense locking patterns where
we lock to fetch a single variable.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22828
sleepq to serialize sleepers. This patch retains the existing sleep/wakeup
paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup
in one case but otherwise should be bug for bug compatible. In particular,
there are still various races surrounding adjusting the limit via sysctl
that are now documented.
Discussed with: markj
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22827
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
The page daemon loops may move pages back to the active queue if
references are detected. In this case we must take care to clear
existing queue operation flags. In particular, PGA_REQUEUE_HEAD may be
set, and that flag is only valid if the page belongs to the inactive
queue.
Also fix a bug in the active queue scan where we were updating "old"
instead of "new". This would only have been hit in rare cases where the
page moved out of the active queue after the beginning of the scan.
Reported by: Bob Prohaska, Idwer Vollering
Tested by: Idwer Vollering
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D23001
entry in the vm_map, making invariants related to the max_free entry
field invalid. Move the clipping work into vm_map_entry_link, so that
linking is okay when the new entry clips a current entry, and the
vm_map doesn't have to be briefly corrupted. Change assertions and
conditions in SPLAY_{LEFT,RIGHT}_STEP since the max_free invariants
can now be trusted in all cases.
Tested by: pho
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D22897
We now set PGA_DEQUEUE on a managed page when it is wired after
allocation, and vm_page_mvqueue() ignores pages with this flag set,
ensuring that they do not end up in the page queues. However, this is
not sufficient for managed fictitious pages or pages managed by the
TTM. In particular, the TTM makes use of the plinks.q queue linkage
fields for its own purposes.
PR: 242961
Reported and tested by: Greg V <greg@unrelenting.technology>
This fixes a regression in r356155, introduced at the last minute. In
particular, we must clear PGA_REQUEUE_HEAD before inserting into any
queue besides PQ_INACTIVE since that operation is implemented only for
PQ_INACTIVE.
Reported by: pho, Jenkins via lwhsu
The previous series of patches orphaned some vm_page functions, so
remove them.
Reviewed by: dougm, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22886
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885
Some recent work aims to remove the use of the page lock for
synchronizing updates to page queue state. This change adds a mechanism
to preserve the existing behaviour of lazily dequeuing wired pages,
which was previously synchronized using the page lock.
Handle this by setting PGA_DEQUEUE when a managed page's wire count
transitions from 0 to 1. When the page daemon encounters a page with a
flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that
page, but in so doing it does not modify the page itself and thus racing
with a concurrent free of the page is harmless. The flag is advisory;
the page daemon still checks for wirings after acquiring the object and
page xbusy locks.
vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition.
It must do this before dropping the reference to avoid a use-after-free
but also handles races with concurrent wirings to ensure that
PGA_DEQUEUE is not left unset on a wired page.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22882
This is in preparation for eliminating the use of the vm_page lock for
protecting queue state operations.
Introduce the vm_page_pqstate_commit_*() functions. These functions act
as helpers around vm_page_astate_fcmpset() and are specialized for
specific types of operations. vm_page_pqstate_commit() wraps these
functions.
Convert a number of routines to use these new helpers. Use
vm_page_release_toq() in vm_page_unwire() and vm_page_release() to
atomically release a wiring reference and release the page into a queue.
This has the side effect that vm_page_unwire() will leave the page in
the active queue if it is already present there.
Convert the page queue scans to use the new helpers. Simplify
vm_pageout_reinsert_inactive(), which requeues pages that were found to
be busy during an inactive queue scan, to avoid duplicating the work of
vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or
PGA_REQUEUE_HEAD is set, let that be handled during batch processing.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22770
Differential Revision: https://reviews.freebsd.org/D22771
Differential Revision: https://reviews.freebsd.org/D22772
Differential Revision: https://reviews.freebsd.org/D22773
Differential Revision: https://reviews.freebsd.org/D22776
This avoids duplicating the work of the page daemon's active queue scan.
Moreover, this duplication was inconsistent:
- PGA_REFERENCED is not counted in act_count unless pmap_ts_referenced()
returned 0, but the page daemon always counts PGA_REFERENCED towards
the activation count.
- The swapout daemon always activates a referenced page, but the page
daemon only does so when the containing object is mapped at least
once.
The main purpose of swapout_deactivate_pages() is to shrink the number
of pages mapped into a given pmap. To do this without unmapping active
pages, use the non-destructive pmap_is_referenced() instead of the
destructive pmap_ts_referenced() and deactivate pages accordingly.
This simplifies some future changes to the locking protocol for page
queue state.
Reviewed by: kib
Discussed with: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22674
In r355270 by me, vm_object_shadow() was changed to handle the
reference counting for the shared case, but the extra reference that
was done in vmspace_fork() for the shared/need_copy case was not
removed.
Submitted by: jeff
allocate them with VM_ALLOC_NOOBJ which means they are not busy. For now
move the busy assert for the new page in vm_page_replace into the public
api and out of the private api used by contig reclaim. Fix another issue
where we would leak busy if the page could not be removed from pmap.
Reported by: pho
Discussed with: markj
the zone size and flags fields in the per-cpu caches. This allows fast
alloctions to proceed only touching the single per-cpu cacheline and
simplifies the common case when no ctor/dtor is specified.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22826
cache area. This allows us to check on bucket space for all per-cpu
buckets with a single cacheline access and fewer branches.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22825
The macro RB_INITIALIZER ignores its argument, but is documented to
require "&head" as argument to initialize "head". So using
"_vm_phys_fictitious_tree" as the argument to initialize
"vm_phys_fictitious_tree" is an inconsequential error, corrected here.
Discussed with: alc
vm_page_remove() rather than !vm_page_wired() as the condition for free.
When this changed back to wired the busy lock was leaked.
Reported by: pho
Reviewed by: markj
removed from objects including calls to free. Pages must not be xbusy
when freed and not on an object. Strengthen assertions to match these
expectations. In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.
Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state. This removes redundant and
potentially problematic code that has proliferated.
Discussed with: markj
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22822
When allocating a replacement page we must clear VPO_UNMANAGED since we
only ever reclaim pages from managed objects. vm_page_replace() does
not handle this for us.
Sprinkle some assertions to help catch this sort of issue.
Reported by: pho
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22868
Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.
Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626
that if fault fails to progress and needs to restart the loop it must free
the page it is working on and allocate again on restart. Resolve the few
places that need to be modified to support this condition and simply
deactivate the page. Presently, we only permit this when fault restarts
for busy contention. This has an added benefit of removing some object
trylocking in this case.
While here consolidate some page cleanup logic into fault_page_free() and
fault_page_release() to reduce redundant code and automate some teardown.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22653
an exclusive object lock.
Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.
Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.
Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.
Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654
require the object lock to synchronize collapse. Other swap objects such
as tmpfs do not.
Reported by: mjg
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22747
exec_map_first_page(). This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).
Discussed with: alc
Approved by: kib
Differential Revision: https://reviews.freebsd.org/D22731
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.
Reviewed by: jeff (previous version), markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22759
uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(),
but before that assignment, startup_alloc() will use pages from the
reserved pool, so the bucket zones themselves are still allocated using
startup pages.
Reviewed by: rlibby
Reported by: Jenkins via lwhsu
Differential Revision: https://reviews.freebsd.org/D22797
This helps with a bootstrapping problem in upcoming work.
We don't first enable buckets until uma_startup2(), so we can delay
bucket creation until then. The other two paths to bucket_enable() are
both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM)
and one in uma_timeout() (first activated in uma_startup3()). Note that
although some bucket functions are accessible before uma_startup2()
(e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone.
Discussed with: jeff
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22765
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22759
Introduce primitives vm_page_astate_load() and vm_page_astate_fcmpset()
to operate on the 32-bit per-page atomic state. Modify
vm_page_pqstate_fcmpset() to use them. No functional change intended.
Introduce PGA_QUEUE_OP_MASK, a subset of PGA_QUEUE_STATE_MASK that only
includes queue operation flags. This will be used in subsequent
patches.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22753
vm_swapout_object_deactivate_pages() is renamed to
vm_swapout_object_deactivate(), and the loop body is moved into the new
vm_swapout_object_deactivate_page(). This makes the code a bit easier
to follow and is in preparation for some functional changes.
Reviewed by: jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22651
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
to its successor in cases where examining a map entry requires a
helper like kvm_read_all. Use that method, with kvm_read_all, to fix
procstat_getfiles_kvm, which tries to find the successor now without
using such a helper. This addresses a problem introduced by r355491.
Reviewed by: markj (previous version)
Discussed with: kib
Differential Revision: https://reviews.freebsd.org/D22728
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
zones would never be freed. In the case of tmpfs this was not true. While
here test for the right bit to disable the keg related sysctls for zones
that don't have kegs.
Reported by: pho
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22655
A valid reference is all that is required. If we race with a deallocation
we will harmlessly misidentify the type of an already dead object.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22636
embedded slabs but also is an opportunity to tidy up code and add
accessor inlines.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22609
space. Where the vm_map tree now has null pointers, store pointers to
next and previous entries in right and left fields, making the binary
tree threaded. Have the predecessor and successor functions compute
what the prev and next fields previously stored.
Reviewed by: markj, kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D21964
Summary:
This matches r351198 from amd64. This only applies to AIM64 and Book-E.
On AIM64 it short-circuits with one domain, to behave similar to
existing. Otherwise it will allocate 16MB huge pages to hold the page
array, across all NUMA domains. On the first domain it will shift the
page array base up, to "upper-align" the page array in that domain, so
as to reduce the number of pages from the next domain appearing in this
domain. After the first domain, subsequent domains will be allocated in
full 16MB pages, until the final domain, which can be short. This means
some inner domains may have pages accounted in earlier domains.
On Book-E the page array is setup at MMU bootstrap time so that it's
always mapped in TLB1, on both 32-bit and 64-bit. This reduces the TLB0
overhead for touching the vm_page_array, which reduces up to one TLB
miss per array access.
Since page_range (vm_page_startup()) is no longer used on Book-E but is on
32-bit AIM, mark the variable as potentially unused, rather than using a
nasty #if defined() list.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21449
Suppose that the map entry is wired, so that we later assign
fault_type = entry->protection. Suppose further that we jump back to
RetryLookup. Then fault_type will no longer contain the original
fault protection mask, but instead that of the wired entry.
Submitted by: Wuyang Chung <wuyang.chung1@gmail.com>
Reviewed by: kib
MFC after: 3 days
Github PR: https://github.com/freebsd/freebsd/pull/419
Differential Revision: https://reviews.freebsd.org/D22683
If the starting pindex is equal to object->size, there is nothing to do.
This was harmless since the rest of vm_map_pmap_enter() has no effect
when psize == 0.
Submitted by: Wuyang Chung <wuyang.chung1@gmail.com>
Reviewed by: alc, dougm, kib
MFC after: 1 week
Github PR: https://github.com/freebsd/freebsd/pull/417
Differential Revision: https://reviews.freebsd.org/D22678
These are cast to uma_import and uma_release functions. Use the signature
for these in the zone functions.
This was found with an experimental Kernel CFI. It will complain if the
signature is different than what a function pointer expects. The
simplest way to fix these is to correct the signature.
Reviewed by: rlibby
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22671
tightening constraints on busy as a precursor to lockless page lookup and
should largely be a NOP for these cases.
Reviewed by: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22611
The handle value is stable for all shadow objects in the inheritance
chain. This allows to avoid descending the shadow chain to get to the
bottom of it in vm_map_entry_set_vnode_text(), and eliminate
corresponding object relocking which appeared to be contending.
Change vm_object_allocate_anon() and vm_object_shadow() to handle more
of the cred/charge initialization for the new shadow object, in
addition to set up the handle.
Reported by: jeff
Reviewed by: alc (previous version), jeff (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differrential revision: https://reviews.freebsd.org/D22541
broken in r355082. Reduce some locking in nearby related object type
checks.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22565
unset until the object is recycled so this check is stable. Now that we
can acquire the ref without a lock it is not necessary to group these
operations and we can avoid it entirely in many cases.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22565
(e.g. root->left = NULL) to affect the behavior of that function. This
change stops that data manipulation, and instead calls a pair of
functions, one for the left direction and the other for the right,
with the function called depending whether or not we currently null
the root child in that direction to control the behavior of
vm_map_splay_merge.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22589
union members in vm_page.h to store the zone and slab. Remove some nearby
dead code.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22564
more statistcs than are exported via the ABI stable vmstat interface.
Rename uz_count to uz_bucket_size because even I was confused by the
name after returning to the source years later.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22554
On INVARIANTS kernels, UMA has a use-after-free detection mechanism.
This mechanism previously required that all of the ctor/dtor/uminit/fini
arguments to uma_zcreate() be NULL in order to function. Now, it only
requires that uminit and fini be NULL; now, the trash ctor and dtor will
be called in addition to any supplied ctor or dtor.
Also do a little refactoring for readability of the resulting logic.
This enables use-after-free detection for more zones, and will allow for
simplification of some callers that worked around the previous
restriction (see kern_mbuf.c).
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20722
omit the object lock if we are above a certain threshold. Hold only a
single vnode reference when the vnode object has any ref > 0. This
allows us to only lock the object and vnode on 0-1 and 1-0 transitions.
Differential Revision: https://reviews.freebsd.org/D22452
make sense after many partial refactors. Attempt to make a smaller cache
footprint for the fast path.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22470
Regression from r352174. In the vm_page_rename() failure case we forgot
to unlock the vm object locks before sleeping and reacquiring them.
Reviewed by: jeff
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22542
'entry'. Where 'entry' is used to identify the starting point for
iteration, use 'first_entry'. These are the naming conventions used in
most of the vm_map.c code. Where VM_MAP_ENTRY_FOREACH can be used, do
so. Squeeze a few lines to fit in 80 columns. Where lines are being
modified for these reasons, look to remove style(9) violations.
Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D22458
Note that the change in vm_object_collapse() is arguably a correctness
fix. We must not collapse into content-identity carrying objects.
Reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22467
Record as much bits from curthread into busy_lock as fits. Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0). Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.
Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler. For this case, add
_unchecked variants of asserts and unbusy primitives.
Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22298
Stop subtracting 1024/200 from vmd_page_count/200. I cannot see how
such precise accounting can make a difference on modern systems.
Add some explanation of what the page daemon does and how it handles
memory shortages.
Reviewed by: dougm
Discussed with: jeff, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22396
Use the UMA reclaim thread to asynchronously drain all caches if
there is a severe shortage in a domain. Otherwise we only trigger UMA
reclamation every 10s even when the system has completely run out of
memory.
Stop entirely draining the caches when one domain falls below its min
threshold. In some workloads it is normal for one NUMA domain to end
up being nearly depleted by kernel memory allocations, for example for
the ZFS ARC. The domainset iterators skip domains below the
vmd_min_free theshold on the first iteration, so we should allow that
mechanism to limit further depletion of the domain's free pages before
taking the extreme step of calling uma_reclaim(UMA_RECLAIM_DRAIN_CPU).
Discussed with: jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22395
- Remove the cnt == 1 check. UMA passes cnt == 1 when it has disabled
per-CPU caching. In this case we might as well just allocate a single
page and return it to the caller, since the caller is going to do
exactly that anyway if the UMA cache allocation attempt fails.
- Don't replenish caches if the domain is severely short on free pages.
With large buckets we may otherwise quickly exacerbate a situation
where the page daemon is failing to keep up.
- Don't replenish caches if the calling thread belongs to the page
daemon, which should avoid creating extra memory pressure when it is
trying to free memory. Virtually all such allocations while occur in
the context of laundering, where the laundry thread must allocate
slabs for various swap and I/O-related UMA zones.
Reviewed by: kib
Discussed with: alc, jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22394
In r353734 the use of the page caches was limited to systems with a
relatively large amount of RAM per CPU. This was to mitigate some
issues reported with the system not able to keep up with memory pressure
in cases where it had been able to do so prior to the addition of the
direct free pool cache. This change re-enables those caches.
The change modifies uma_zone_set_maxcache(), which was introduced
specifically for the page cache zones. Rather than using it to limit
only the full bucket cache, have it also set uz_count_max to provide an
upper bound on the per-CPU cache size that is consistent with the number
of items requested. Remove its return value since it has no use.
Enable the page cache zones unconditionally, and limit them to 0.1% of
the domain's pages. The limit can be overridden by the
vm.pgcache_zone_max tunable as before.
Change the item size parameter passed to uma_zcache_create() to the
correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page
cache buckets to be adaptively sized, like the rest of UMA's caches.
This also causes the initial bucket size to be small, so only systems
which benefit from large caches will get them.
Reviewed by: gallatin, jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22393
We were not properly handling the case where the trylock of the
reservaton fails, in which case we could leak reservation lock.
Introduce a marker reservation to implement precise scanning in
vm_reserv_reclaim_contig(). Before, a race could result in early
termination of the scan in rare situations. Use the marker's lock to
serialize scans of the partpop queue so that a global marker structure
can be used. Modify vm_reserv_reclaim_inactive() to handle the presence
of a marker while minimizing the hold time of domain-global locks.
Reviewed by: alc, jeff, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22392
entry, when that entry has been seen already, keep the
already-looked-up value in a variable and use that instead of looking
it up again.
Approved by: alc, markj (earlier version), kib (earlier version)
Differential Revision: https://reviews.freebsd.org/D22348
so that we avoid the hashtables. The hashtable is now only required if
a zone is created with OFFPAGE specified initially, not internally. This
flag signals to UMA that it can't touch the allocated memory and so
can't store a slab pointer in the containing page.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22453
reudundant complicated checks and additional locking required only for
anonymous memory. Introduce vm_object_allocate_anon() to create these
objects. DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.
Reviewed by: alc, dougm, kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22119
only. Rename it swp_pager_meta_lookup. Stop checking for obj->type
== swap there and assert it instead. Make the caller responsible for
the obj->type check.
Move the meta_ctl 'pop' functionality to swap_pager_unswapped, the
only place that uses it, and assume obj->type == swap there too.
Assisted by: ota_j.email.ne.jp
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22437
We currently have the per-domain partially populated reservation queues
and the per-domain queue locks. Define a new per-domain padded
structure to contain both of them. This puts the queue fields and lock
in the same cache line and avoids the false sharing within the old queue
array.
Also fix field packing in the reservation structure. In many places we
assume that a domain index fits in 8 bits, so we can do the same there
as well. This reduces the size of the structure by 8 bytes.
Update some comments while here. No functional change intended.
Reviewed by: dougm, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22391
We are now out of aflags bits, whereas the "flags" field only makes use
of five of its sixteen bits, so narrow "flags" to eight bits. I have no
intention of adding a new aflag in the near future, but would like to
combine the aflags, queue and act_count fields into a single atomically
updated word. This will allow vm_page_pqstate_cmpset() to become much
simpler and is a step towards eliminating the use of the page lock array
in updating per-page queue state.
The change modifies the layout of struct vm_page, so bump
__FreeBSD_version.
Reviewed by: alc, dougm, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22397
entries are stabilized, repeatedly verifies the same entry. Check each
entry in turn.
Reviewed by: kib (code only), alc
Tested by: pho
MFC after: 7 days
Differential Revision: https://reviews.freebsd.org/D22405
around entry->{next,prev} when those are used for ordered list
traversal, and use those wrapper functions everywhere. Where the next
field is used for maintaining a stack of deferred operations, #define
defer_next to make that different usage clearer, and then use the
'right' pointer instead of 'next' for that purpose.
Approved by: markj
Tested by: pho (as part of a larger patch)
Differential Revision: https://reviews.freebsd.org/D22347
exploits the sparsity of allocated blocks in a range, without
issuing an "are you there?" query for every block in the range.
swap_pager_copy() is not so smart. Modify the implementation
of swap_pager_meta_free() slightly so that swap_pager_copy()
can use that smarter implementation too.
Based on an observation of: Yoshihiro Ota (ota_j.email.ne.jp)
Reviewed by: kib,alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22280
The r354367 is reverted since it is subsumed by this, more complete, approach.
Suggested by: markj
Reviewed by: alc. glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22242
consistency checking slows performance dramatically. This change
reduces the number of assertions checked by completely walking the
vm_map tree only when the write-lock is released, and only then if the
number of modifications to the tree since the last walk exceeds the
number of tree nodes.
Reviewed by: alc, kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22163
Before the page busy code was converted to make direct use of
sleepqueues, this was handled by _sleep().
Reported by: glebius
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Since r354156 we may call release_page() without the page's object lock
held, specifically following the page copy during a CoW fault.
release_page() must therefore unbusy the page only after scheduling the
requeue, to avoid racing with a free of the page. Previously, the
object lock prevented this race from occurring.
Add some assertions that were helpful in tracking this down.
Reported by: pho, syzkaller
Tested by: pho
Reviewed by: alc, jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22234
Early counter mock can be only used on BSP for amd64, when APs try to
update it that causes random memory corruption.
N.B. This is a temporary patch to plug the corruption for now, while
a proper solution for handling cache zones in zone_foreach() is being
developed.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
flag and use the same system.
This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116
Certain consumers still need to guarantee a stable reference so we can not
switch entirely to atomics yet. Exclusive lock holders can still modify
and examine the refcount without using the ref api.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21598
When it is set to 0 (the default), a heavy Netflix-style web workload
suffers from heavy lock contention on the vm page free queue called from
vm_page_zone_{import,release}() as the buckets are frequently drained.
When setting the maxcache, this contention goes away.
We should eventually try to autotune this, as well as make this
zone eligable for uma_reclaim().
Reviewed by: alc, markj
Not Objected to by: jeff
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22112
r353890 introduced a case where we may call release_page() with
fs.m == NULL, since the fault handler may now lock the vnode prior
to allocating a page for a page-in.
Reported by: jhb
Reviewed by: kib
MFC with: r353890
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22120
We now assert that a page is busy when updating its validity-tracking
state, but bogus_page is not busied during a getpages operation.
Reported by: syzkaller
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22124
A caller that does not guarantee that a page's identity won't change
while sleeping for a busy lock must specify either NOWAIT or WAITFAIL.
Reported by: syzkaller
Reviewed by: alc, kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22124
except for filesystems that set the MNTK_VMSETSIZE_BUG, Set the flag for ZFS.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D21883
The flag specifies that vm_fault() handler should check the vnode'
vm_object size under the vnode lock. It is converted into the object'
OBJ_SIZEVNLOCK flag in vnode_pager_alloc().
Tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D21883
The correctness of per-CPU cache accounting in that function is
dependent on reading per-CPU pointers exactly once. Ensure that
the compiler does not emit multiple loads of those pointers.
Reported and tested by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22081
In low memory conditions a significant number of pages may end up stuck
in the caches, and currently these caches cannot be reaped, leading to
spurious memory allocation failures and OOM kills. So:
- Take into account the fact that we may cache up to two full buckets
of pages per CPU, not just one.
- Increase the amount of RAM required per CPU to enable the caches.
This is a temporary measure until the page cache management policy is
improved.
PR: 241048
Reported and tested by: Kevin Oberman <rkoberman@gmail.com>
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22040
With an upcoming change the amd64 kernel will map preloaded files RW
instead of RWX, so the kernel linker must adjust protections
appropriately using pmap_change_prot().
Reviewed by: kib
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21860
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.
The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594