We now set PGA_DEQUEUE on a managed page when it is wired after
allocation, and vm_page_mvqueue() ignores pages with this flag set,
ensuring that they do not end up in the page queues. However, this is
not sufficient for managed fictitious pages or pages managed by the
TTM. In particular, the TTM makes use of the plinks.q queue linkage
fields for its own purposes.
PR: 242961
Reported and tested by: Greg V <greg@unrelenting.technology>
This fixes a regression in r356155, introduced at the last minute. In
particular, we must clear PGA_REQUEUE_HEAD before inserting into any
queue besides PQ_INACTIVE since that operation is implemented only for
PQ_INACTIVE.
Reported by: pho, Jenkins via lwhsu
The previous series of patches orphaned some vm_page functions, so
remove them.
Reviewed by: dougm, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22886
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885
Some recent work aims to remove the use of the page lock for
synchronizing updates to page queue state. This change adds a mechanism
to preserve the existing behaviour of lazily dequeuing wired pages,
which was previously synchronized using the page lock.
Handle this by setting PGA_DEQUEUE when a managed page's wire count
transitions from 0 to 1. When the page daemon encounters a page with a
flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that
page, but in so doing it does not modify the page itself and thus racing
with a concurrent free of the page is harmless. The flag is advisory;
the page daemon still checks for wirings after acquiring the object and
page xbusy locks.
vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition.
It must do this before dropping the reference to avoid a use-after-free
but also handles races with concurrent wirings to ensure that
PGA_DEQUEUE is not left unset on a wired page.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22882
This is in preparation for eliminating the use of the vm_page lock for
protecting queue state operations.
Introduce the vm_page_pqstate_commit_*() functions. These functions act
as helpers around vm_page_astate_fcmpset() and are specialized for
specific types of operations. vm_page_pqstate_commit() wraps these
functions.
Convert a number of routines to use these new helpers. Use
vm_page_release_toq() in vm_page_unwire() and vm_page_release() to
atomically release a wiring reference and release the page into a queue.
This has the side effect that vm_page_unwire() will leave the page in
the active queue if it is already present there.
Convert the page queue scans to use the new helpers. Simplify
vm_pageout_reinsert_inactive(), which requeues pages that were found to
be busy during an inactive queue scan, to avoid duplicating the work of
vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or
PGA_REQUEUE_HEAD is set, let that be handled during batch processing.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22770
Differential Revision: https://reviews.freebsd.org/D22771
Differential Revision: https://reviews.freebsd.org/D22772
Differential Revision: https://reviews.freebsd.org/D22773
Differential Revision: https://reviews.freebsd.org/D22776
This avoids duplicating the work of the page daemon's active queue scan.
Moreover, this duplication was inconsistent:
- PGA_REFERENCED is not counted in act_count unless pmap_ts_referenced()
returned 0, but the page daemon always counts PGA_REFERENCED towards
the activation count.
- The swapout daemon always activates a referenced page, but the page
daemon only does so when the containing object is mapped at least
once.
The main purpose of swapout_deactivate_pages() is to shrink the number
of pages mapped into a given pmap. To do this without unmapping active
pages, use the non-destructive pmap_is_referenced() instead of the
destructive pmap_ts_referenced() and deactivate pages accordingly.
This simplifies some future changes to the locking protocol for page
queue state.
Reviewed by: kib
Discussed with: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22674
In r355270 by me, vm_object_shadow() was changed to handle the
reference counting for the shared case, but the extra reference that
was done in vmspace_fork() for the shared/need_copy case was not
removed.
Submitted by: jeff
allocate them with VM_ALLOC_NOOBJ which means they are not busy. For now
move the busy assert for the new page in vm_page_replace into the public
api and out of the private api used by contig reclaim. Fix another issue
where we would leak busy if the page could not be removed from pmap.
Reported by: pho
Discussed with: markj
the zone size and flags fields in the per-cpu caches. This allows fast
alloctions to proceed only touching the single per-cpu cacheline and
simplifies the common case when no ctor/dtor is specified.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22826
cache area. This allows us to check on bucket space for all per-cpu
buckets with a single cacheline access and fewer branches.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22825
The macro RB_INITIALIZER ignores its argument, but is documented to
require "&head" as argument to initialize "head". So using
"_vm_phys_fictitious_tree" as the argument to initialize
"vm_phys_fictitious_tree" is an inconsequential error, corrected here.
Discussed with: alc
vm_page_remove() rather than !vm_page_wired() as the condition for free.
When this changed back to wired the busy lock was leaked.
Reported by: pho
Reviewed by: markj
removed from objects including calls to free. Pages must not be xbusy
when freed and not on an object. Strengthen assertions to match these
expectations. In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.
Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state. This removes redundant and
potentially problematic code that has proliferated.
Discussed with: markj
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22822
When allocating a replacement page we must clear VPO_UNMANAGED since we
only ever reclaim pages from managed objects. vm_page_replace() does
not handle this for us.
Sprinkle some assertions to help catch this sort of issue.
Reported by: pho
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22868
Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.
Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626
that if fault fails to progress and needs to restart the loop it must free
the page it is working on and allocate again on restart. Resolve the few
places that need to be modified to support this condition and simply
deactivate the page. Presently, we only permit this when fault restarts
for busy contention. This has an added benefit of removing some object
trylocking in this case.
While here consolidate some page cleanup logic into fault_page_free() and
fault_page_release() to reduce redundant code and automate some teardown.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22653
an exclusive object lock.
Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.
Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.
Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.
Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654
require the object lock to synchronize collapse. Other swap objects such
as tmpfs do not.
Reported by: mjg
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22747
exec_map_first_page(). This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).
Discussed with: alc
Approved by: kib
Differential Revision: https://reviews.freebsd.org/D22731
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.
Reviewed by: jeff (previous version), markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22759
uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(),
but before that assignment, startup_alloc() will use pages from the
reserved pool, so the bucket zones themselves are still allocated using
startup pages.
Reviewed by: rlibby
Reported by: Jenkins via lwhsu
Differential Revision: https://reviews.freebsd.org/D22797
This helps with a bootstrapping problem in upcoming work.
We don't first enable buckets until uma_startup2(), so we can delay
bucket creation until then. The other two paths to bucket_enable() are
both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM)
and one in uma_timeout() (first activated in uma_startup3()). Note that
although some bucket functions are accessible before uma_startup2()
(e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone.
Discussed with: jeff
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22765
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22759
Introduce primitives vm_page_astate_load() and vm_page_astate_fcmpset()
to operate on the 32-bit per-page atomic state. Modify
vm_page_pqstate_fcmpset() to use them. No functional change intended.
Introduce PGA_QUEUE_OP_MASK, a subset of PGA_QUEUE_STATE_MASK that only
includes queue operation flags. This will be used in subsequent
patches.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22753
vm_swapout_object_deactivate_pages() is renamed to
vm_swapout_object_deactivate(), and the loop body is moved into the new
vm_swapout_object_deactivate_page(). This makes the code a bit easier
to follow and is in preparation for some functional changes.
Reviewed by: jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22651
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
to its successor in cases where examining a map entry requires a
helper like kvm_read_all. Use that method, with kvm_read_all, to fix
procstat_getfiles_kvm, which tries to find the successor now without
using such a helper. This addresses a problem introduced by r355491.
Reviewed by: markj (previous version)
Discussed with: kib
Differential Revision: https://reviews.freebsd.org/D22728
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
zones would never be freed. In the case of tmpfs this was not true. While
here test for the right bit to disable the keg related sysctls for zones
that don't have kegs.
Reported by: pho
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22655
A valid reference is all that is required. If we race with a deallocation
we will harmlessly misidentify the type of an already dead object.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22636
embedded slabs but also is an opportunity to tidy up code and add
accessor inlines.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22609