flag and use the same system.
This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116
When it is set to 0 (the default), a heavy Netflix-style web workload
suffers from heavy lock contention on the vm page free queue called from
vm_page_zone_{import,release}() as the buckets are frequently drained.
When setting the maxcache, this contention goes away.
We should eventually try to autotune this, as well as make this
zone eligable for uma_reclaim().
Reviewed by: alc, markj
Not Objected to by: jeff
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22112
A caller that does not guarantee that a page's identity won't change
while sleeping for a busy lock must specify either NOWAIT or WAITFAIL.
Reported by: syzkaller
Reviewed by: alc, kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22124
In low memory conditions a significant number of pages may end up stuck
in the caches, and currently these caches cannot be reaped, leading to
spurious memory allocation failures and OOM kills. So:
- Take into account the fact that we may cache up to two full buckets
of pages per CPU, not just one.
- Increase the amount of RAM required per CPU to enable the caches.
This is a temporary measure until the page cache management policy is
improved.
PR: 241048
Reported and tested by: Kevin Oberman <rkoberman@gmail.com>
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22040
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.
The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
busy acquires while held.
This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592
This persists busy state across operations like rename and replace.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21549
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21551
About 11 minutes of poudriere -s -j 104 and probing on return value of
trylocks reveals that over 10% of attempts fail, which in turn means
there are more atomics performed than necessary.
Trylocking was there to try preventing migration, but it's not very likely
to happen if the lock is uncontested.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21925
vm_page_swapqueue() atomically transitions a page between queues. To do
so, it must hold the page queue lock for the old queue. However, once
the queue index has been updated, the queue lock no longer protects the
page's queue state. Thus, we must speculatively remove the page from
the old queue before committing the queue state update, and roll back if
the update fails.
Reported and tested by: pho
Reviewed by: kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21791
For now, just count batched page queue state operations.
vm.stats.page.queue_ops counts the number of batch entries that
successfully completed, while queue_nops counts entries that had no
effect, which occurs when the queue operation had been completed before
the batch entry was processed.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21782
This function loaded the page's queue index before setting PGA_DEQUEUE.
In this window the page daemon may have deactivated the page, updating
its queue index. Make the operation atomic using vm_page_pqstate_cmpset();
the page daemon will not modify the page once it observes that PGA_DEQUEUE
is set.
Reported and tested by: pho
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
After r352110 the attempt to remove mappings of the page being replaced
may fail if the page is wired. In this case we must free the replacement
page.
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.
Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
Queue operations on a page use the page lock when updating the page to
reflect the desired queue state, and the page queue lock when physically
enqueuing or dequeuing a page. Multiple pages share a given page lock,
but queue state is per-page; this false sharing results in heavy lock
contention.
Take a small step towards the use of atomic_cmpset to synchronize
updates to per-page queue state by introducing vm_page_pqstate_cmpset()
and using it in the page daemon. In the longer term the plan is to stop
using the page lock to protect page identity and rely only on the object
and page busy locks. However, since the page daemon avoids acquiring
the object lock except when necessary, some synchronization with a
concurrent free of the page is required. vm_page_pqstate_cmpset() can
be used to ensure that queue state updates are successful only if the
page is not scheduled for a dequeue, which is sufficient for the page
daemon.
Add vm_page_swapqueue(), which moves a page from one queue to another
using vm_page_pqstate_cmpset(). Use it in the active queue scan, which
does not use the object lock. Modify vm_page_dequeue_deferred() to
use vm_page_pqstate_cmpset() as well.
Reviewed by: kib
Discussed with: jeff
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21257
r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose. One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.
To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.
Reviewed by: kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21491
- Don't bother masking off non-queue state flags when loading the
page's atomic state, since it is only required for one of the
function's assertions. Update the assertion instead.
- Remove an incorrect comment regarding synchronization with the
page daemon. The page daemon only ever checks for PGA_ENQUEUED
with the page queue lock held.
- When clearing requeue flags, only clear the flags that have been
acted upon.
Reviewed by: kib (previous version)
Discussed with: alc
Tested by: pho (part of a larger patch)
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21368
All existing callers guarantee that the page does not have a
pre-existing dequeue pending. Thus, if the page is dequeued before
pqbatch_submit() acquires the page queue lock, we do not need to do
anything since vm_page_dequeue_complete() takes care of clearing all
page queue state flags for us.
With this change, vm_page_pqbatch_submit() has the nice property that it
does not directly modify any fields in the page structure.
Reviewed by: alc, kib
Tested by: pho (part of a larger change)
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21372
It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent. No functional change intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21369
- Add a vm_pagequeue_remove() function to physically remove a page
from its queue and update the queue length.
- Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue()
return NULL for dequeued pages.
- Avoid unnecessarily reloading the queue index if vm_page_dequeue()
loses a race with a concurrent queue operation.
- Correct an always-true assertion: vm_page_dequeue() may be called
from the page allocator with the page unlocked. The assertion
m->order == VM_NFREEORDER simply tests whether the page has been
removed from the vm_phys free lists; instead, check whether the
page belongs to an object.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21341
NUMA domain that the pages describe. Patch original from gallatin.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21252
NUMA aware boot time memory allocator that will be used to allocate early
domain correct structures. Code partially submitted by gallatin.
Reviewed by: gallatin, kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21251
In addition to pagedaemon initiating OOM, also do it from the
vm_fault() internals. Namely, if the thread waits for a free page to
satisfy page fault some preconfigured amount of time, trigger OOM.
These triggers are rate-limited, due to a usual case of several
threads of the same multi-threaded process to enter fault handler
simultaneously. The faults from pagedaemon threads participate in the
calculation of OOM rate, but are not under the limit.
Reviewed by: markj (previous version)
Tested by: pho
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13671
Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page. Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.
As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O. This is
consistent with the handling of pages by sendfile(SF_NOCACHE).
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20986
I would like to use the name vm_page_release() for a different purpose,
and vm_page_{import,release}() are local to vm_page.c.
Reviewed by: kib
MFC after: 1 week
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
Pages with PG_PCPU_CACHE set cannot have been allocated from a
reservation, so as an optimization, skip the call to
vm_reserv_free_page() in this case. Otherwise, the access of
the corresponding reservation structure often results in a cache
miss.
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20859
Some workloads benefit from having a per-CPU cache for
VM_FREEPOOL_DIRECT pages.
Reviewed by: dougm, kib
Discussed with: alc, jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20858
Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase. In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break. Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.
The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.
Reviewed by: alc
Discussed with: jeff
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20763
Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.
This is a step towards an implementation of an atomic reference counter
for each physical page structure.
Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20758
The listed rules were incomplete and outdated. There is a much more
comprehensive comment in vm_page.h.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20503
as part of a false start toward fine-grained reservation locking. In the
end, they were not needed, so eliminate them.
Order the parameters to vm_reserv_alloc_{contig,page}() consistently with
the vm_page functions that call them.
Update the comments about the locking requirements for
vm_reserv_alloc_{contig,page}(). They no longer require a free page
queues lock.
Wrap several lines that became too long after the "req" and "domain"
parameters were added to vm_reserv_alloc_{contig,page}().
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20492
Use it instead of accessing the wire_count field directly. No
functional change intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485
power-of-two page block it frees, launching an unsuccessful search for
a buddy to pair up with each time. The only possible buddy-up mergers
are across the boundaries of the freed region, so change
vm_phys_free_contig simply to enqueue the freed interior blocks, via a
new function vm_phys_enqueue_contig, and then call vm_phys_free_pages
on the bounding blocks to create as big a cross-boundary block as
possible after buddy-merging.
The only callers of vm_phys_free_contig at the moment call it in
situations where merging blocks across the boundary is clearly
impossible, so just call vm_phys_enqueue_contig in those places and
avoid trying to buddy-up at all.
One beneficiary of this change is in breaking reservations. For the
case where memory is freed in breaking a reservation with only the
first and last pages allocated, the number of cycles consumed by the
operation drops about 11% with this change.
Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D16901
This is a step towards being able to free pages without the page
lock held. The approach is simply to add an implementation of
vm_page_dequeue_deferred() which does not assert that the page
lock is held. Formally, the page lock is required to set
PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the
same mutual exclusion for free by virtue of the fact that no
other references to the page may exist.
No functional change intended.
Reviewed by: kib (previous version)
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19065
To detect the case where the page is already marked for a deferred
dequeue, we must read the "queue" and "aflags" fields in a
precise order. Otherwise, a race with a concurrent
vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE
set despite it already having been dequeued. Fix the problem by
using vm_page_queue() to check the queue state, which correctly
handles the race.
Reviewed by: kib
Tested by: pho
MFC after: 3 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19039
two zones sharing a keg may have different limits. Now this is going
to work:
zone = uma_zcreate();
uma_zone_set_max(zone, limit);
zone2 = uma_zsecond_create(zone);
uma_zone_set_max(zone2, limit2);
Kegs no longer have uk_maxpages field, but zones have uz_items. When
set, it may be rounded up to minimum possible CPU bucket cache size.
For small limits bucket cache can also be reconfigured to be smaller.
Counter uz_items is updated whenever items transition from keg to a
bucket cache or directly to a consumer. If zone has uz_maxitems set and
it is reached, then we are going to sleep.
o Since new limits don't play well with multi-keg zones, remove them. The
idea of multi-keg zones was introduced exactly 10 years ago, and never
have had a practical usage. In discussion with Jeff we came to a wild
agreement that if we ever want to reintroduce the idea of a smart allocator
that would be able to choose between two (or more) totally different
backing stores, that choice should be made one level higher than UMA,
e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
by the caller.
o Sleeping code is improved to account number of sleepers and wake them one
by one, to avoid thundering herd problem.
o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
KPI added. Having no bucket cache basically means setting maxcache to 0.
o Now with many fields added and many removed (no multi-keg zones!) make
sure that struct uma_zone is perfectly aligned.
Reviewed by: markj, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D17773