Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21551
About 11 minutes of poudriere -s -j 104 and probing on return value of
trylocks reveals that over 10% of attempts fail, which in turn means
there are more atomics performed than necessary.
Trylocking was there to try preventing migration, but it's not very likely
to happen if the lock is uncontested.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21925
vm_page_swapqueue() atomically transitions a page between queues. To do
so, it must hold the page queue lock for the old queue. However, once
the queue index has been updated, the queue lock no longer protects the
page's queue state. Thus, we must speculatively remove the page from
the old queue before committing the queue state update, and roll back if
the update fails.
Reported and tested by: pho
Reviewed by: kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21791
For now, just count batched page queue state operations.
vm.stats.page.queue_ops counts the number of batch entries that
successfully completed, while queue_nops counts entries that had no
effect, which occurs when the queue operation had been completed before
the batch entry was processed.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21782
This function loaded the page's queue index before setting PGA_DEQUEUE.
In this window the page daemon may have deactivated the page, updating
its queue index. Make the operation atomic using vm_page_pqstate_cmpset();
the page daemon will not modify the page once it observes that PGA_DEQUEUE
is set.
Reported and tested by: pho
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
After r352110 the attempt to remove mappings of the page being replaced
may fail if the page is wired. In this case we must free the replacement
page.
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.
Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
Queue operations on a page use the page lock when updating the page to
reflect the desired queue state, and the page queue lock when physically
enqueuing or dequeuing a page. Multiple pages share a given page lock,
but queue state is per-page; this false sharing results in heavy lock
contention.
Take a small step towards the use of atomic_cmpset to synchronize
updates to per-page queue state by introducing vm_page_pqstate_cmpset()
and using it in the page daemon. In the longer term the plan is to stop
using the page lock to protect page identity and rely only on the object
and page busy locks. However, since the page daemon avoids acquiring
the object lock except when necessary, some synchronization with a
concurrent free of the page is required. vm_page_pqstate_cmpset() can
be used to ensure that queue state updates are successful only if the
page is not scheduled for a dequeue, which is sufficient for the page
daemon.
Add vm_page_swapqueue(), which moves a page from one queue to another
using vm_page_pqstate_cmpset(). Use it in the active queue scan, which
does not use the object lock. Modify vm_page_dequeue_deferred() to
use vm_page_pqstate_cmpset() as well.
Reviewed by: kib
Discussed with: jeff
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21257
r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose. One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.
To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.
Reviewed by: kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21491
- Don't bother masking off non-queue state flags when loading the
page's atomic state, since it is only required for one of the
function's assertions. Update the assertion instead.
- Remove an incorrect comment regarding synchronization with the
page daemon. The page daemon only ever checks for PGA_ENQUEUED
with the page queue lock held.
- When clearing requeue flags, only clear the flags that have been
acted upon.
Reviewed by: kib (previous version)
Discussed with: alc
Tested by: pho (part of a larger patch)
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21368
All existing callers guarantee that the page does not have a
pre-existing dequeue pending. Thus, if the page is dequeued before
pqbatch_submit() acquires the page queue lock, we do not need to do
anything since vm_page_dequeue_complete() takes care of clearing all
page queue state flags for us.
With this change, vm_page_pqbatch_submit() has the nice property that it
does not directly modify any fields in the page structure.
Reviewed by: alc, kib
Tested by: pho (part of a larger change)
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21372
It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent. No functional change intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21369
- Add a vm_pagequeue_remove() function to physically remove a page
from its queue and update the queue length.
- Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue()
return NULL for dequeued pages.
- Avoid unnecessarily reloading the queue index if vm_page_dequeue()
loses a race with a concurrent queue operation.
- Correct an always-true assertion: vm_page_dequeue() may be called
from the page allocator with the page unlocked. The assertion
m->order == VM_NFREEORDER simply tests whether the page has been
removed from the vm_phys free lists; instead, check whether the
page belongs to an object.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21341
NUMA domain that the pages describe. Patch original from gallatin.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21252
NUMA aware boot time memory allocator that will be used to allocate early
domain correct structures. Code partially submitted by gallatin.
Reviewed by: gallatin, kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21251
In addition to pagedaemon initiating OOM, also do it from the
vm_fault() internals. Namely, if the thread waits for a free page to
satisfy page fault some preconfigured amount of time, trigger OOM.
These triggers are rate-limited, due to a usual case of several
threads of the same multi-threaded process to enter fault handler
simultaneously. The faults from pagedaemon threads participate in the
calculation of OOM rate, but are not under the limit.
Reviewed by: markj (previous version)
Tested by: pho
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13671
Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page. Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.
As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O. This is
consistent with the handling of pages by sendfile(SF_NOCACHE).
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20986
I would like to use the name vm_page_release() for a different purpose,
and vm_page_{import,release}() are local to vm_page.c.
Reviewed by: kib
MFC after: 1 week
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
Pages with PG_PCPU_CACHE set cannot have been allocated from a
reservation, so as an optimization, skip the call to
vm_reserv_free_page() in this case. Otherwise, the access of
the corresponding reservation structure often results in a cache
miss.
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20859
Some workloads benefit from having a per-CPU cache for
VM_FREEPOOL_DIRECT pages.
Reviewed by: dougm, kib
Discussed with: alc, jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20858
Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase. In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break. Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.
The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.
Reviewed by: alc
Discussed with: jeff
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20763
Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.
This is a step towards an implementation of an atomic reference counter
for each physical page structure.
Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20758
The listed rules were incomplete and outdated. There is a much more
comprehensive comment in vm_page.h.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20503
as part of a false start toward fine-grained reservation locking. In the
end, they were not needed, so eliminate them.
Order the parameters to vm_reserv_alloc_{contig,page}() consistently with
the vm_page functions that call them.
Update the comments about the locking requirements for
vm_reserv_alloc_{contig,page}(). They no longer require a free page
queues lock.
Wrap several lines that became too long after the "req" and "domain"
parameters were added to vm_reserv_alloc_{contig,page}().
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20492
Use it instead of accessing the wire_count field directly. No
functional change intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485
power-of-two page block it frees, launching an unsuccessful search for
a buddy to pair up with each time. The only possible buddy-up mergers
are across the boundaries of the freed region, so change
vm_phys_free_contig simply to enqueue the freed interior blocks, via a
new function vm_phys_enqueue_contig, and then call vm_phys_free_pages
on the bounding blocks to create as big a cross-boundary block as
possible after buddy-merging.
The only callers of vm_phys_free_contig at the moment call it in
situations where merging blocks across the boundary is clearly
impossible, so just call vm_phys_enqueue_contig in those places and
avoid trying to buddy-up at all.
One beneficiary of this change is in breaking reservations. For the
case where memory is freed in breaking a reservation with only the
first and last pages allocated, the number of cycles consumed by the
operation drops about 11% with this change.
Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D16901
This is a step towards being able to free pages without the page
lock held. The approach is simply to add an implementation of
vm_page_dequeue_deferred() which does not assert that the page
lock is held. Formally, the page lock is required to set
PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the
same mutual exclusion for free by virtue of the fact that no
other references to the page may exist.
No functional change intended.
Reviewed by: kib (previous version)
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19065
To detect the case where the page is already marked for a deferred
dequeue, we must read the "queue" and "aflags" fields in a
precise order. Otherwise, a race with a concurrent
vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE
set despite it already having been dequeued. Fix the problem by
using vm_page_queue() to check the queue state, which correctly
handles the race.
Reviewed by: kib
Tested by: pho
MFC after: 3 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19039
two zones sharing a keg may have different limits. Now this is going
to work:
zone = uma_zcreate();
uma_zone_set_max(zone, limit);
zone2 = uma_zsecond_create(zone);
uma_zone_set_max(zone2, limit2);
Kegs no longer have uk_maxpages field, but zones have uz_items. When
set, it may be rounded up to minimum possible CPU bucket cache size.
For small limits bucket cache can also be reconfigured to be smaller.
Counter uz_items is updated whenever items transition from keg to a
bucket cache or directly to a consumer. If zone has uz_maxitems set and
it is reached, then we are going to sleep.
o Since new limits don't play well with multi-keg zones, remove them. The
idea of multi-keg zones was introduced exactly 10 years ago, and never
have had a practical usage. In discussion with Jeff we came to a wild
agreement that if we ever want to reintroduce the idea of a smart allocator
that would be able to choose between two (or more) totally different
backing stores, that choice should be made one level higher than UMA,
e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
by the caller.
o Sleeping code is improved to account number of sleepers and wake them one
by one, to avoid thundering herd problem.
o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
KPI added. Having no bucket cache basically means setting maxcache to 0.
o Now with many fields added and many removed (no multi-keg zones!) make
sure that struct uma_zone is perfectly aligned.
Reviewed by: markj, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D17773
Otherwise the free page count will not accurately reflect the physical
page allocator's state. On 11 this can trigger panics in
vm_page_alloc() since the allocator state and free page count are
updated atomically and we expect them to stay in sync. On 12 the
bug would manifest as threads looping in vm_page_alloc().
PR: 231296
Reported by: mav, wollman, Rainer Duffner, Josh Gitlin
Reviewed by: alc, kib, mav
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18374
- In uma_prealloc(), we need to check for an empty domain before the
first allocation attempt, not after. Fix this by switching
uma_prealloc() to use a vm_domainset iterator, which addresses the
secondary issue of using a signed domain identifier in round-robin
iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
excluding empty domains; otherwise we may frequently specify an empty
domain when calling in to the page allocator, wasting CPU time.
Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
total page counts for now: some vm_phys segments are created before
the SRAT is parsed and are thus always identified as being in domain 0
even when they are not. Then, when bootstrap pages are freed, they
are added to a domain that we had previously thought was empty. Until
this is corrected, we simply exclude them from the per-domain page
count.
Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17704
Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc". The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy. Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.
In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations. To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.
Reviewed by: jeff (previous version)
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17417
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.
Reviewed by: alc, jeff, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17278
No functional change intended.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17028
While executing vm_pqbatch_process_page(m), m->queue may change to
PQ_NONE if the page daemon is concurrently freeing the page. In this
case m's queue state flags must be clear, so vm_pqbatch_process_page()
will be a no-op, but the race could cause spurious assertion failures.
Correct the assertion which assumed that m->queue's value does not
change while the page queue lock is held.
Reviewed by: alc, kib
Reported and tested by: pho
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17027
other allowed domains if the requested domain is below the minimum paging
threshold. Block in fork only if all domains available to the forking
thread are below the severe threshold rather than any.
Submitted by: jeff
Reported by: mjg
Reviewed by: alc, kib, markj
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D16191
Testing m->queue != PQ_NONE is not sufficient; see the commit log
message for r338276. As of r332974 vm_page_dequeue() handles
already-dequeued pages, so just replace vm_page_remque() calls with
vm_page_dequeue() calls.
Reviewed by: kib
Tested by: pho
Approved by: re (marius)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17025
Expose these counters under the vm.domain sysctl node. The existing
vm.stats.vm.v_pdpages sysctl is preserved.
Reviewed by: alc (previous version)
Differential Revision: https://reviews.freebsd.org/D14666
Per-page queue state is updated non-atomically, with either the page
lock or the page queue lock held. When vm_page_dequeue() is called
without the page lock, in rare cases a different thread may be
concurrently dequeuing the page with the pagequeue lock held. Because
of the non-atomic update, vm_page_dequeue() might return before queue
state is completely updated, which can lead to race conditions.
Restrict the vm_page_dequeue() interface so that it must be called
either with the page lock held or on a free page, and busy wait when
a different thread is concurrently updating queue state, which must
happen in a critical section.
While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked(). Replace the volatile qualifier for "queue"
added in r333703 with explicit uses of atomic_load_8() where required.
Reported and tested by: pho
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D15980
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.
In preparation to fix this create two macros depending on if the data is
global or static.
Reviewed by: bz, emaste, markj
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D16140
On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold()
on arbitrary userspace mapping. If the mapping is backed by the
non-managed cdev pager or by the sg pager, on dense configs we might
access arbitrary element of vm_page_array[], in particular, not
corresponding to a page from the memory segment. Initialize such pages
as fictitious with the corresponding physical address.
Reported by: bde
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D16085
that it does not cause rapid fragmentation of the free physical memory.
Reviewed by: jeff, markj (an earlier version)
Differential Revision: https://reviews.freebsd.org/D15976
The value of m->queue must be cached after comparing it with PQ_NONE,
since it may be concurrently changing.
Reported by: glebius
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D15462
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.
Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.
Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893
They were previously initialized by the corresponding page daemon
threads, but for vmd_inacthead this may be too late if
vm_page_deactivate_noreuse() is called during boot.
Reported and tested by: cperciva
Reviewed by: alc, kib
MFC after: 1 week
Pages allocated from a given reservation may belong to different
objects. It is therefore possible for vm_page_ps_test() to be called
with the base page's object unlocked. Check for this case before
asserting that the object lock is held.
Reported by: jhb
Reviewed by: kib
MFC after: 1 week
SKZ63 Processor May Hang When Executing Code In an HLE Transaction
Region
Problem: Under certain conditions, if the processor acquires an HLE
(Hardware Lock Elision) lock via the XACQUIRE instruction in the Host
Physical Address range between 40000000H and 403FFFFFH, it may hang
with an internal timeout error (MCACOD 0400H) logged into
IA32_MCi_STATUS.
Move the pages from the range into the blacklist. Add a tunable to
not waste 4M if local DoS is not the issue.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15001
per-cpu alloc and free of pages. The cache is filled with as few trips
to the phys allocator as possible by the use of a new
vm_phys_alloc_npages() function which allocates as many as N pages.
This code was originally by markj with the import function rewritten by
me.
Reviewed by: markj, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14905
Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.
Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user. This
also eliminates the need for the complicated interrupt binding code.
Add a sysctl API for viewing and manipulating domainsets. Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both. This probably belongs in a dedicated subr file.
Attempt to improve the include situation.
Reviewed by: kib
Discussed with: jhb (cpuset parts)
Tested by: pho (before review feedback)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14839
vmd_free_count with atomics.
This allows us to allocate and free from reservations without the free lock
except where a superpage is allocated from the physical layer, which is
roughly 1/512 of the operations on amd64.
Use the counter api to eliminate cache conention on counters.
Reviewed by: markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14707
In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.
Reviewed by: alc
Tested by: pho
MFC after: 2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625
The new page does not belong to a VM object, but the page daemon does
not expect to encounter such pages.
Reviewed by: alc, kib
Tested by: pho
MFC after: 1 week
X-Differential Revision: https://reviews.freebsd.org/D14625
vmd_free_count manipulation. Reduce the scope of the free lock by
using a pageout lock to synchronize sleep and wakeup. Only trigger
the pageout daemon on transitions between states. Drive all wakeup
operations directly as side-effects from freeing memory rather than
requiring an additional function call.
Reviewed by: markj, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14612
There, the pages freed might be managed but the page's lock is not
owned. For KPI correctness, the page lock is requried around the call
to vm_page_free_prep(), which is asserted. Reclaim loop already did
the work which could be done by vm_page_free_prep(), so the lock is
not needed and the only consequence of not owning it is the assert
trigger.
Instead of adding the locking to satisfy the assert, revert to the
code that calls vm_page_free_phys() directly.
Reported by: pho
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
After r328977, a wired page m may have m->queue != PQ_NONE.
Reviewed by: kib
X-MFC with: r328977
Differential Revision: https://reviews.freebsd.org/D14485
use it to regulate page daemon output.
This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.
Reviewed by: bsdimp
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14402
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass. If there is no object
associated with the wait, use curthread' policy domainset. The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.
Eliminate pagedaemon_wait(). vm_domain_clear() handles the same
operations.
Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.
Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.
Scetched and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D14384
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch
off of this zone from startup_alloc() until full launch of VM.
o Always supply number of VM zones to uma_startup_count(). On machines
with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
a page. In the latter case account VM zones for number of allocations
from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
itself any zone that is already capable of running real alloc.
In worst case scenario we may leak a single page here. See comment
in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
pages calculation, we are guaranteed to use all of the boot_pages
in the early single threaded stage.
Reported & tested by: mav
o Most of startup zones have struct uma_slab embedded into the slab,
so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
when calculating how many pages would certain kind of allocations
require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
and this also needs to be accounted for. [2]
While here, spread more comments and improve diagnostic messages.
Reported by: pho [1], jtl [2]
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.
The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.
The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.
Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
o Call uma_startup1() after initializing kmem, vmem and domains.
o Include 8 eight VM startup pages into uma_startup_count() calculation.
o Account for vmem_startup() and vm_map_startup() preallocating pages.
o Account for extra two allocations done by kmem_init() and vmem_create().
o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT
allowed several other SYSINITs to sneak in before it, thus bumping
requirement for amount of boot pages.
for UMA startup.
o Introduce another stage of UMA startup, which is entered after
vm_page_startup() finishes. After this stage we don't yet enable buckets,
but we can ask VM for pages. Rename stages to meaningful names while here.
New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
BOOT_RUNNING.
Enabling page alloc earlier allows us to dramatically reduce number of
boot pages required. What is more important number of zones becomes
consistent across different machines, as no MD allocations are done before
the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
startup_alloc(), however that may change, so vm_page_startup() provides
its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
functions calculates sizes of zones zone and kegs zone, and calculates how
many pages UMA will need to bootstrap.
It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
but also as args.size.
Reviewed by: imp, gallatin (earlier version)
Differential Revision: https://reviews.freebsd.org/D14054
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
userspace to control NUMA policy administratively and programmatically.
Implement domainset based iterators in the page layer.
Remove the now legacy numa_* syscalls.
Cleanup some header polution created by having seq.h in proc.h.
Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403
Both issues caused the page daemon to erroneously go to sleep when
applications are consuming free pages at a high rate, leaving the
application threads blocked in VM_WAIT.
1) After completing an inactive queue scan, concurrent allocations may
have prevented the page daemon from meeting the v_free_min threshold.
In this case, the page daemon was going to sleep even when the
inactive queue contained plenty of clean pages.
2) pagedaemon_wakeup() may be called without the free queues lock held.
This can lead to a lost wakeup if a call occurs after the page daemon
clears vm_pageout_wanted but before going to sleep.
Fix 1) by ensuring that we start a new inactive queue scan immediately
if v_free_count < v_free_min after a prior scan.
Fix 2) by adding a new subroutine, pagedaemon_wait(), called from
vm_wait() and vm_waitpfault(). It wakes up the page daemon if either
vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps
on v_free_count.
Reported by: jeff
Reviewed by: alc
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13424
Commit r326346 moved domain iterators from physical layer to vm_page one,
but it also removed translation of freelist to flind for
vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter,
but after it expect freelist index.
On small WiFi boxes with few megabytes of RAM, there is only one freelist
VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file
sys/mips/include/vmparam.h). It results in freelist 1 with flind 0.
At first, this commit renames flind to freelist in vm_page_alloc_freelist
to avoid misunderstanding about input parameters. Then on physical layer it
restores translation for correct handling of freelist parameter.
Reported by: landonf
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D13351
This gives a marginal improvement in the vm_page_array initialization
time. Also garbage-collect the now-unused vm_phys_paddr_to_segind().
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13270
Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to
satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter
fictitous pages. For now, allow for this possibility; such pages will be
skipped later in the scan since they are wired.
Reported by: avg
Reviewed by: kib
MFC after: 1 week
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
similar to the kernel memory allocator.
This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.
A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.
Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
A fictitious page is always wired, so there is no point in trying to
remove one from the page queues.
Completely remove one inaccurate comment from vm_page_free_prep() and
correct another.
Reviewed by: kib, markj
MFC after: 1 week
This catches some rare mysterious failures at the source. The check
is only performed on architectures which implement direct map, and
only enabled with option DIAGNOSTIC, similar to other costly
consistency checks.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The variable is modified with the highly contended page free queue lock.
It unnecessarily shares a cacheline with purely read-only fields and is
re-read after the lock is dropped in the page allocation code making the
hold time longer.
Pad the variable just like the others and store the value as found with
the lock held instead of re-reading.
Provides a modest 1%-ish speed up in concurrent page faults.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D12665
vm_page_try_to_free() is testing conditions, like clean versus dirty,
that only vary in managed pages.
Suggested by: kib
Reviewed by: markj
X-MFC after: never
can be avoided when the page's containing object has a reference count of
zero. (If the object has a reference count of zero, then none of its pages
can possibly be mapped.)
Address nearby style issues in vm_page_try_to_free(), and change its
return type to "bool".
Reviewed by: kib, markj
MFC after: 1 week
and insertion into the phys allocator free queues vm_page_free_phys().
Also provide a wrapper vm_page_free_phys_pglist() for batched free.
Reviewed by: alc, markj
Tested by: mjg (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Kegs for internal zones always keep the slab header in the slab itself.
Therefore, when determining the allocation size, we need to take the
slab header size into account.
Reported and tested by: ae, rakuco
Reviewed by: avg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12342
lock if both old and new pages use the same underlying lock. Convert
existing places to use the helper instead of inlining it. Use the
optimization in vm_object_page_remove().
Suggested and reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
We currently initialize the vm_page array in three passes: one to zero
the array, one to initialize the "order" field of each page (necessary
when inserting them into the vm_phys buddy allocator one-by-one), and
one to initialize the remaining non-zero fields and individually insert
each page into the allocator.
Merge the three passes into one following a suggestion from alc:
initialize vm_page fields in a single pass, and use vm_phys_free_contig()
to efficiently insert physical memory segments into the buddy allocator.
This reduces the initialization time to a third or a quarter of what it
was before on most systems that I tested.
Reviewed by: alc, kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D12248
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.
For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.
The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.
MFC after: 1 week
This is a variant of vm_page_alloc() which accepts an additional parameter:
the page in the object with largest index that is smaller than the requested
index. vm_page_alloc() finds this page using a lookup in the object's radix
tree, but in some cases its identity is already known, allowing the lookup
to be elided.
Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after().
vm_page_alloc() is converted into a trivial wrapper of
vm_page_alloc_after().
Suggested by: alc
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11984
vm_page_grab() on consecutive page indices. Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().
Reviewed by: kib, markj
Tested by: pho (an earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11926
superpage all belong to the same object. To date, that check has not been
needed, but upcoming changes require it. (See the Differential Revision.)
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11556
r31386 changed how the size of the VM page array was calculated to be
less wasteful. For most systems, the amount of memory is divided by
the overhead required by each page (a page of data plus a struct vm_page)
to determine the maximum number of available pages. However, if the
remainder for the first non-available page was at least a page of data
(so that the only memory missing was a struct vm_page), this last page
was left in phys_avail[] but was not allocated an entry in the VM page
array. Handle this case by explicitly excluding the page from
phys_avail[].
Reviewed by: alc
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D11000
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
one respect. When determining how many page structures to allocate,
contrary to what the comments say, the code does not account for the
overhead of a page structure per page of physical memory. This revision
changes the code to match the comments.
Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9081
vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.
While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9098
On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.
Reviewed by: alc
add support for object types that were previously prohibited because they
could contain PG_CACHED pages.
Roughly halve the number of radix trie operations performed by
vm_page_alloc_contig() using the same approach that is employed by
vm_page_alloc(). Also, eliminate the radix trie lookup performed with the
free page queues lock held.
Tidy up the handling of radix trie insert failures in vm_page_alloc() and
vm_page_alloc_contig().
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8878
independent layer of the virtual memory system. Update some of the nearby
comments to eliminate redundancy and improve clarity.
In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.
Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8752
contain a vm_page_t at the specified index. However, with this
change, vm_radix_remove() no longer panics. Instead, it returns NULL
if there is no vm_page_t at the specified index. Otherwise, it
returns the vm_page_t. The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations. Instead of performing a lookup before every remove,
the pmap can simply perform the remove.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D8708
a new page of radix trie nodes to complete a vm_radix_insert() operation
that was requested by vm_page_cache(). Specifically, vm_page_cache()
already held the free page queue lock when UMA tried to acquire it through
a call to vm_page_alloc(). This code path no longer exists, so there is no
longer any reason to allow recursion on the free page queue mutex.
Improve nearby comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8628
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.
Bump __FreeBSD_version to mark this change.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497
pages, specificially, dirty pages that have passed once through the inactive
queue. A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them. The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time. In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low. Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system. Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages. This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.
The new laundry thread sleeps while waiting for a request from the page
daemon thread(s). A request is raised by setting the variable
vm_laundry_request and waking the laundry thread. We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering"). When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering. If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back
to sleep without doing any work. When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.
In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target. In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.
A laundry request can be latched while another is currently being
serviced. In particular, a shortfall request will immediately preempt a
background laundering.
This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning
of vm_cnt.v_reactivated now better reflects its name. It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.
In collaboration with: markj
Reviewed by: kib
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8302
Requests which cannot be satisfied by allocators at boot time often
have unrealizable parameters. Waiting for the pagedaemon' start would
hang the boot if done in the thread0 context and just never succeed if
executed from another thread. In fact, for very early stages, sleep
attempt panics with obscure diagnostic about the scheduler state, and
explicit panic in vm_wait() makes the investigation much shorter by
cut off the examination of the thread and scheduler.
Theoretically, some subsystem might grab a resource to exhaustion, and
free it later in the boot process. If this unlikely scenario does
appear for real, the way to diagnose the trouble can be revisited.
Reported by: emaste
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8421
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}
Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.
More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).
In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.
Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.
Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().
Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.
Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196
waiters. Otherwise, owners of the shared-busy state are left blocked
and might get into a deadlock.
Note that the vm_page_busy_downgrade() function is not used in the
tree right now.
Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8195
by vm_pageout_scan() local to vm_pageout_worker(). There is no reason
to store the pass in the NUMA domain structure.
Reviewed by: kib
MFC after: 3 weeks
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.
Reviewed by: alc, imp, kib
Differential Revision: https://reviews.freebsd.org/D7714
The removal of vm_fault_additional_pages() meant that a hard fault on
a swap-backed page would result in only that page being read in. This
change implements readahead and readbehind for the swap pager in
swap_pager_getpages(). swap_pager_haspage() is modified to return the
largest contiguous non-resident range of pages containing the requested
range.
Reviewed by: alc, kib
Tested by: pho
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7677
of CPUs present. On amd64 this unbreaks the boot for systems with 92 or
more CPUs; the limit will vary on other systems depending on the size of
their uma_zone and uma_cache structures.
The major consumer of pages during UMA startup is the 19 zone structures
which are set up before UMA has bootstrapped itself sufficiently to use
the rest of the available memory: UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 /
16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP,
KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg. If the zone structures occupy
more than one page, they will not share pages and the number of pages
currently needed for startup is 19 * pages_per_zone + N, where N is the
number of pages used for allocating other structures; on amd64 N = 3 at
present (2 pages are allocated for UMA Kegs, and one page for UMA Hash).
This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32,
and if a zone structure does not fit into a single page sets boot_pages to
UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which
remains at 64). Consequently this patch has no effect on systems where the
zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but
increases boot_pages sufficiently on systems where the large number of CPUs
makes this structure larger. It seems safe to assume that systems with 60+
CPUs can afford to set aside an additional 128kB of memory per 32 CPUs.
The vm.boot_pages tunable continues to override this computation, but is
unlikely to be necessary in the future.
Tested on: EC2 x1.32xlarge
Relnotes: FreeBSD can now boot on 92+ CPU systems without requiring
vm.boot_pages to be manually adjusted.
Reviewed by: jeff, alc, adrian
Approved by: re (kib)
waiters exist, same as for vm_page_xunbusy(). If previous value of
busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is
not needed.
Move common code from vm_page_xunbusy_maybelocked() and
vm_page_xunbusy_hard() to vm_page_xunbusy_locked().
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
Per the KASSERT at the beginning of the function, we expect that the page
does not belong to any object, so its object and pindex fields are
meaningless. Reset them in the rare case that vm_radix_insert() fails.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D6669
acquire the page lock, which recurses. Avoid the recursion by reusing
the code from vm_page_remove() in a new helper
vm_page_xunbusy_maybelocked().
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
indicate that threads are waiting for free pages to become available and
(2) to indicate whether a wakeup call has been sent to the page daemon.
The trouble is that a single flag cannot really serve both purposes, because
we have two distinct targets for when to wakeup threads waiting for free
pages versus when the page daemon has completed its work. In particular,
the flag will be cleared by vm_page_free() before the page daemon has met
its target, and this can lead to the OOM killer being invoked prematurely.
To address this problem, a new flag "vm_pageout_wanted" is introduced.
Discussed with: jeff
Reviewed by: kib, markj
Tested by: markj
Sponsored by: EMC / Isilon Storage Division
for empty page cache when the object type if OBJT_VNODE.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
freed page as VPO_UNMANAGED. Otherwise vm_pge_free_toq() insists on
owning the page lock.
Previously, VPO_UNMANAGED was only set up to the last processed page.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
pq_vcnt, as a count of real things, has no business being negative. It is only
ever initialized by a u_int counter.
The warning came from the atomic_add_int() in vm_pagequeue_cnt_add().
Rectify the warning by changing the variable to u_int. No functional change.
Suggested by: Clang 3.3
Sponsored by: EMC / Isilon Storage Division
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.
This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.