Commit Graph

814 Commits

Author SHA1 Message Date
Mark Johnston
d307bdcc2c Further constrain the use of per-CPU caches for free pages.
In low memory conditions a significant number of pages may end up stuck
in the caches, and currently these caches cannot be reaped, leading to
spurious memory allocation failures and OOM kills.  So:

- Take into account the fact that we may cache up to two full buckets
  of pages per CPU, not just one.
- Increase the amount of RAM required per CPU to enable the caches.

This is a temporary measure until the page cache management policy is
improved.

PR:		241048
Reported and tested by:	Kevin Oberman <rkoberman@gmail.com>
Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22040
2019-10-18 17:36:42 +00:00
Mark Johnston
01cef4caa7 Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore().  Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21823
2019-10-16 22:03:27 +00:00
Jeff Roberson
fff5403f84 (5/6) Move the VPO_NOSYNC to PGA_NOSYNC to eliminate the dependency on the
object lock in vm_page_set_validclean().

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21595
2019-10-15 03:48:22 +00:00
Jeff Roberson
0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Jeff Roberson
205be21d99 (3/6) Add a shared object busy synchronization mechanism that blocks new page
busy acquires while held.

This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held.  This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.

Reviewed by:    kib
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21592
2019-10-15 03:41:36 +00:00
Jeff Roberson
8da1c09853 (2/6) Don't release xbusy in vm_page_remove(), defer to vm_page_free_prep().
This persists busy state across operations like rename and replace.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:  https://reviews.freebsd.org/D21549
2019-10-15 03:38:02 +00:00
Jeff Roberson
63e9755548 (1/6) Replace busy checks with acquires where it is trival to do so.
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21548
2019-10-15 03:35:11 +00:00
Leandro Lupori
0ecc478b74 [PPC64] Initial kernel minidump implementation
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).

Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D21551
2019-10-14 13:04:04 +00:00
Mark Johnston
4090e2170d Assert that the PGA_{WRITEABLE,EXECUTABLE} flags do not leak.
Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21783
2019-10-07 23:31:17 +00:00
Mateusz Guzik
7b1fbc424a vm: stop trylocking page queues in vm_page_pqbatch_submit
About 11 minutes of poudriere -s -j 104 and probing on return value of
trylocks reveals that over 10% of attempts fail, which in turn means
there are more atomics performed than necessary.

Trylocking was there to try preventing migration, but it's not very likely
to happen if the lock is uncontested.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21925
2019-10-07 23:19:09 +00:00
Mark Johnston
7cc833c598 Fix a race in vm_page_swapqueue().
vm_page_swapqueue() atomically transitions a page between queues.  To do
so, it must hold the page queue lock for the old queue.  However, once
the queue index has been updated, the queue lock no longer protects the
page's queue state.  Thus, we must speculatively remove the page from
the old queue before committing the queue state update, and roll back if
the update fails.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21791
2019-09-27 16:46:08 +00:00
Mark Johnston
2b93f779d2 Add some counters for per-VM page events.
For now, just count batched page queue state operations.
vm.stats.page.queue_ops counts the number of batch entries that
successfully completed, while queue_nops counts entries that had no
effect, which occurs when the queue operation had been completed before
the batch entry was processed.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21782
2019-09-25 17:08:35 +00:00
Mark Johnston
923da43e7c Fix a race in vm_page_dequeue_deferred_free() after r352110.
This function loaded the page's queue index before setting PGA_DEQUEUE.
In this window the page daemon may have deactivated the page, updating
its queue index.  Make the operation atomic using vm_page_pqstate_cmpset();
the page daemon will not modify the page once it observes that PGA_DEQUEUE
is set.

Reported and tested by:	pho
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:12:49 +00:00
Mark Johnston
47aef898ea Fix a page leak in vm_page_reclaim_run().
After r352110 the attempt to remove mappings of the page being replaced
may fail if the page is wired.  In this case we must free the replacement
page.

Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:09:31 +00:00
Mark Johnston
e8bcf6966b Revert r352406, which contained changes I didn't intend to commit. 2019-09-16 15:04:45 +00:00
Mark Johnston
41fd4b9422 Fix a couple of nits in r352110.
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:03:12 +00:00
Jeff Roberson
c75757481f Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
 - vm_page_grab_valid() will grab and page in if necessary.
 - vm_page_busy_acquire() automates some busy acquire loops.

Discussed with:	alc, kib, markj
Tested by:	pho (part of larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21546
2019-09-10 19:08:01 +00:00
Jeff Roberson
4cdea4a853 Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state.  The object lock is still used as an interlock
to ensure that the identity stays valid.  Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by:	alc, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21255
2019-09-10 18:27:45 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Mark Johnston
7cdeaf3309 Add preliminary support for atomic updates of per-page queue state.
Queue operations on a page use the page lock when updating the page to
reflect the desired queue state, and the page queue lock when physically
enqueuing or dequeuing a page.  Multiple pages share a given page lock,
but queue state is per-page; this false sharing results in heavy lock
contention.

Take a small step towards the use of atomic_cmpset to synchronize
updates to per-page queue state by introducing vm_page_pqstate_cmpset()
and using it in the page daemon.  In the longer term the plan is to stop
using the page lock to protect page identity and rely only on the object
and page busy locks.  However, since the page daemon avoids acquiring
the object lock except when necessary, some synchronization with a
concurrent free of the page is required.  vm_page_pqstate_cmpset() can
be used to ensure that queue state updates are successful only if the
page is not scheduled for a dequeue, which is sufficient for the page
daemon.

Add vm_page_swapqueue(), which moves a page from one queue to another
using vm_page_pqstate_cmpset().  Use it in the active queue scan, which
does not use the object lock.  Modify vm_page_dequeue_deferred() to
use vm_page_pqstate_cmpset() as well.

Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21257
2019-09-03 14:29:58 +00:00
Mark Johnston
9d75f0dc75 Map the vm_page array into KVA on amd64.
r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose.  One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.

To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.

Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21491
2019-09-03 13:18:51 +00:00
Mark Johnston
a70e17eeca Fix a few nits in vm_pqbatch_process_page().
- Don't bother masking off non-queue state flags when loading the
  page's atomic state, since it is only required for one of the
  function's assertions.  Update the assertion instead.
- Remove an incorrect comment regarding synchronization with the
  page daemon.  The page daemon only ever checks for PGA_ENQUEUED
  with the page queue lock held.
- When clearing requeue flags, only clear the flags that have been
  acted upon.

Reviewed by:	kib (previous version)
Discussed with:	alc
Tested by:	pho (part of a larger patch)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21368
2019-08-26 20:20:10 +00:00
Mark Johnston
f93670b7b9 Stop clearing page flags in vm_page_pqbatch_submit().
All existing callers guarantee that the page does not have a
pre-existing dequeue pending.  Thus, if the page is dequeued before
pqbatch_submit() acquires the page queue lock, we do not need to do
anything since vm_page_dequeue_complete() takes care of clearing all
page queue state flags for us.

With this change, vm_page_pqbatch_submit() has the nice property that it
does not directly modify any fields in the page structure.

Reviewed by:	alc, kib
Tested by:	pho (part of a larger change)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21372
2019-08-23 19:53:11 +00:00
Mark Johnston
386eba08bd Make vm_pqbatch_submit_page() externally visible.
It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure.  Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent.  No functional change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21369
2019-08-23 19:49:29 +00:00
Mark Johnston
8b90607f20 Simplify vm_page_dequeue() and fix an assertion.
- Add a vm_pagequeue_remove() function to physically remove a page
  from its queue and update the queue length.
- Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue()
  return NULL for dequeued pages.
- Avoid unnecessarily reloading the queue index if vm_page_dequeue()
  loses a race with a concurrent queue operation.
- Correct an always-true assertion: vm_page_dequeue() may be called
  from the page allocator with the page unlocked.  The assertion
  m->order == VM_NFREEORDER simply tests whether the page has been
  removed from the vm_phys free lists; instead, check whether the
  page belongs to an object.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21341
2019-08-21 16:11:12 +00:00
Jeff Roberson
3e5e1b5135 Allocate amd64's page array using pages and page directory pages from the
NUMA domain that the pages describe.  Patch original from gallatin.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21252
2019-08-18 23:07:56 +00:00
Jeff Roberson
b7565d44df Encapsulate phys_avail manipulation in a set of simple routines. Add a
NUMA aware boot time memory allocator that will be used to allocate early
domain correct structures.  Code partially submitted by gallatin.

Reviewed by:	gallatin, kib
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21251
2019-08-18 07:06:31 +00:00
Konstantin Belousov
245139c69d Fix OOM handling of some corner cases.
In addition to pagedaemon initiating OOM, also do it from the
vm_fault() internals.  Namely, if the thread waits for a free page to
satisfy page fault some preconfigured amount of time, trigger OOM.
These triggers are rate-limited, due to a usual case of several
threads of the same multi-threaded process to enter fault handler
simultaneously.  The faults from pagedaemon threads participate in the
calculation of OOM rate, but are not under the limit.

Reviewed by:	markj (previous version)
Tested by:	pho
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13671
2019-08-16 09:43:49 +00:00
Mark Johnston
98549e2dc6 Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().
Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page.  Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O.  This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20986
2019-07-29 22:01:28 +00:00
Mark Johnston
b16e57a6c9 Rename vm_page_{import,release}() to vm_page_zone_{import,release}().
I would like to use the name vm_page_release() for a different purpose,
and vm_page_{import,release}() are local to vm_page.c.

Reviewed by:	kib
MFC after:	1 week
2019-07-20 18:25:41 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
Mark Johnston
46736e306c Elide the vm_reserv_free_page() call when PG_PCPU_CACHE is set.
Pages with PG_PCPU_CACHE set cannot have been allocated from a
reservation, so as an optimization, skip the call to
vm_reserv_free_page() in this case.  Otherwise, the access of
the corresponding reservation structure often results in a cache
miss.

Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20859
2019-07-08 19:02:40 +00:00
Mark Johnston
d9a73522e3 Add a per-CPU page cache per VM free pool.
Some workloads benefit from having a per-CPU cache for
VM_FREEPOOL_DIRECT pages.

Reviewed by:	dougm, kib
Discussed with:	alc, jeff
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20858
2019-07-08 18:56:30 +00:00
Mark Johnston
9f74cdbf78 Mark pages allocated from the per-CPU cache.
Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase.  In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break.  Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.

The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.

Reviewed by:	alc
Discussed with:	jeff
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20763
2019-07-02 19:51:40 +00:00
Mark Johnston
0fd977b3fa Add a return value to vm_page_remove().
Use it to indicate whether the page may be safely freed following
its removal from the object.  Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.

This is a step towards an implementation of an atomic reference counter
for each physical page structure.

Reviewed by:	alc, dougm, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20758
2019-06-26 17:37:51 +00:00
Mark Johnston
ee1f168540 Group vm_page_activate()'s definition with other related functions.
No functional change intended.

MFC after:	3 days
2019-06-19 21:36:00 +00:00
Mark Johnston
2d2748710a Remove an outdated header comment for vm_page.c.
The listed rules were incomplete and outdated.  There is a much more
comprehensive comment in vm_page.h.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20503
2019-06-04 18:38:27 +00:00
Alan Cox
2d5039db18 Retire vm_reserv_extend_{contig,page}(). These functions were introduced
as part of a false start toward fine-grained reservation locking.  In the
end, they were not needed, so eliminate them.

Order the parameters to vm_reserv_alloc_{contig,page}() consistently with
the vm_page functions that call them.

Update the comments about the locking requirements for
vm_reserv_alloc_{contig,page}().  They no longer require a free page
queues lock.

Wrap several lines that became too long after the "req" and "domain"
parameters were added to vm_reserv_alloc_{contig,page}().

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20492
2019-06-03 05:15:36 +00:00
Mark Johnston
d842aa5114 Add a vm_page_wired() predicate.
Use it instead of accessing the wire_count field directly.  No
functional change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20485
2019-06-02 01:00:17 +00:00
Doug Moore
b8590dae50 The function vm_phys_free_contig invokes vm_phys_free_pages for every
power-of-two page block it frees, launching an unsuccessful search for
a buddy to pair up with each time.  The only possible buddy-up mergers
are across the boundaries of the freed region, so change
vm_phys_free_contig simply to enqueue the freed interior blocks, via a
new function vm_phys_enqueue_contig, and then call vm_phys_free_pages
on the bounding blocks to create as big a cross-boundary block as
possible after buddy-merging.

The only callers of vm_phys_free_contig at the moment call it in
situations where merging blocks across the boundary is clearly
impossible, so just call vm_phys_enqueue_contig in those places and
avoid trying to buddy-up at all.

One beneficiary of this change is in breaking reservations.  For the
case where memory is freed in breaking a reservation with only the
first and last pages allocated, the number of cycles consumed by the
operation drops about 11% with this change.

Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D16901
2019-05-31 21:02:42 +00:00
Mark Johnston
42447bb506 Remove a redundant vm_page_remove() call.
vm_page_free_prep() removes the page from its object.  No functional
change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20469
2019-05-31 14:59:40 +00:00
Mark Johnston
3b5b20292b Implement minidump support for RISC-V.
Submitted by:	Mitchell Horne <mhorne063@gmail.com>
Differential Revision:	https://reviews.freebsd.org/D18320
2019-03-06 00:01:06 +00:00
Mark Johnston
1e2b3e6f92 Allow vm_page_free_prep() to dequeue pages without the page lock.
This is a step towards being able to free pages without the page
lock held.  The approach is simply to add an implementation of
vm_page_dequeue_deferred() which does not assert that the page
lock is held.  Formally, the page lock is required to set
PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the
same mutual exclusion for free by virtue of the fact that no
other references to the page may exist.

No functional change intended.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19065
2019-02-03 18:43:20 +00:00
Mark Johnston
d0488e698f Fix a race in vm_page_dequeue_deferred().
To detect the case where the page is already marked for a deferred
dequeue, we must read the "queue" and "aflags" fields in a
precise order.  Otherwise, a race with a concurrent
vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE
set despite it already having been dequeued.  Fix the problem by
using vm_page_queue() to check the queue state, which correctly
handles the race.

Reviewed by:	kib
Tested by:	pho
MFC after:	3 days
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19039
2019-02-03 18:38:58 +00:00
Gleb Smirnoff
bb15d1c778 o Move zone limit from keg level up to zone level. This means that now
two zones sharing a keg may have different limits. Now this is going
  to work:

  zone = uma_zcreate();
  uma_zone_set_max(zone, limit);
  zone2 = uma_zsecond_create(zone);
  uma_zone_set_max(zone2, limit2);

  Kegs no longer have uk_maxpages field, but zones have uz_items. When
  set, it may be rounded up to minimum possible CPU bucket cache size.
  For small limits bucket cache can also be reconfigured to be smaller.
  Counter uz_items is updated whenever items transition from keg to a
  bucket cache or directly to a consumer. If zone has uz_maxitems set and
  it is reached, then we are going to sleep.

o Since new limits don't play well with multi-keg zones, remove them. The
  idea of multi-keg zones was introduced exactly 10 years ago, and never
  have had a practical usage. In discussion with Jeff we came to a wild
  agreement that if we ever want to reintroduce the idea of a smart allocator
  that would be able to choose between two (or more) totally different
  backing stores, that choice should be made one level higher than UMA,
  e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
  by the caller.

o Sleeping code is improved to account number of sleepers and wake them one
  by one, to avoid thundering herd problem.

o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
  KPI added. Having no bucket cache basically means setting maxcache to 0.

o Now with many fields added and many removed (no multi-keg zones!) make
  sure that struct uma_zone is perfectly aligned.

Reviewed by:	markj, jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D17773
2019-01-15 00:02:06 +00:00
Gleb Smirnoff
9cc36b3dab Fix regression in r331368, that broke dumping of UMA startup pages
when WITNESS is present.

Discussed with:	markj
2019-01-07 23:17:09 +00:00
Konstantin Belousov
7af4985245 Add 'v' modifier to the ddb 'show pginfo' command to display vm_page
backing the provided kernel virtual address.

Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-30 15:58:18 +00:00
Mark Johnston
e31fc3ab13 Update the free page count when blacklisting pages.
Otherwise the free page count will not accurately reflect the physical
page allocator's state.  On 11 this can trigger panics in
vm_page_alloc() since the allocator state and free page count are
updated atomically and we expect them to stay in sync.  On 12 the
bug would manifest as threads looping in vm_page_alloc().

PR:		231296
Reported by:	mav, wollman, Rainer Duffner, Josh Gitlin
Reviewed by:	alc, kib, mav
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18374
2018-11-29 16:31:01 +00:00
Mark Johnston
920239efde Fix some problems that manifest when NUMA domain 0 is empty.
- In uma_prealloc(), we need to check for an empty domain before the
  first allocation attempt, not after.  Fix this by switching
  uma_prealloc() to use a vm_domainset iterator, which addresses the
  secondary issue of using a signed domain identifier in round-robin
  iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
  excluding empty domains; otherwise we may frequently specify an empty
  domain when calling in to the page allocator, wasting CPU time.
  Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
  total page counts for now: some vm_phys segments are created before
  the SRAT is parsed and are thus always identified as being in domain 0
  even when they are not.  Then, when bootstrap pages are freed, they
  are added to a domain that we had previously thought was empty.  Until
  this is corrected, we simply exclude them from the per-domain page
  count.

Reported and tested by:	Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by:	gallatin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17704
2018-10-30 17:57:40 +00:00
Mark Johnston
4c29d2de67 Refactor domainset iterators for use by malloc(9) and UMA.
Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc".  The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy.  Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.

In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations.  To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.

Reviewed by:	jeff (previous version)
Tested by:	pho (part of a larger patch)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17417
2018-10-23 16:35:58 +00:00
Mark Johnston
463406ac4a Add more NUMA-specific low memory predicates.
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.

Reviewed by:	alc, jeff, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17278
2018-09-24 19:24:17 +00:00
Mark Johnston
7a364d458a Split some checks in vm_page_activate() to make it easier to read.
No functional change intended.

Reviewed by:	alc, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17028
2018-09-10 18:59:23 +00:00
Mark Johnston
5a7f993702 Relax an assertion in vm_pqbatch_process_page().
While executing vm_pqbatch_process_page(m), m->queue may change to
PQ_NONE if the page daemon is concurrently freeing the page.  In this
case m's queue state flags must be clear, so vm_pqbatch_process_page()
will be a no-op, but the race could cause spurious assertion failures.
Correct the assertion which assumed that m->queue's value does not
change while the page queue lock is held.

Reviewed by:	alc, kib
Reported and tested by:	pho
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17027
2018-09-08 21:49:43 +00:00
Mark Johnston
23984ce5cd Avoid resource deadlocks when one domain has exhausted its memory. Attempt
other allowed domains if the requested domain is below the minimum paging
threshold.  Block in fork only if all domains available to the forking
thread are below the severe threshold rather than any.

Submitted by:	jeff
Reported by:	mjg
Reviewed by:	alc, kib, markj
Approved by:	re (rgrimes)
Differential Revision:	https://reviews.freebsd.org/D16191
2018-09-06 19:28:52 +00:00
Mark Johnston
21f01f4584 Remove vm_page_remque().
Testing m->queue != PQ_NONE is not sufficient; see the commit log
message for r338276.  As of r332974 vm_page_dequeue() handles
already-dequeued pages, so just replace vm_page_remque() calls with
vm_page_dequeue() calls.

Reviewed by:	kib
Tested by:	pho
Approved by:	re (marius)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17025
2018-09-06 16:17:45 +00:00
Mark Johnston
899fe184c7 Add a per-pagequeue pdpages counter.
Expose these counters under the vm.domain sysctl node.  The existing
vm.stats.vm.v_pdpages sysctl is preserved.

Reviewed by:	alc (previous version)
Differential Revision:	https://reviews.freebsd.org/D14666
2018-08-23 21:03:45 +00:00
Mark Johnston
99d92d732f Ensure that queue state is cleared when vm_page_dequeue() returns.
Per-page queue state is updated non-atomically, with either the page
lock or the page queue lock held.  When vm_page_dequeue() is called
without the page lock, in rare cases a different thread may be
concurrently dequeuing the page with the pagequeue lock held.  Because
of the non-atomic update, vm_page_dequeue() might return before queue
state is completely updated, which can lead to race conditions.

Restrict the vm_page_dequeue() interface so that it must be called
either with the page lock held or on a free page, and busy wait when
a different thread is concurrently updating queue state, which must
happen in a critical section.

While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked().  Replace the volatile qualifier for "queue"
added in r333703 with explicit uses of atomic_load_8() where required.

Reported and tested by:	pho
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D15980
2018-08-23 20:34:22 +00:00
Andrew Turner
2bf9501287 Create a new macro for static DPCPU data.
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by:	bz, emaste, markj
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D16140
2018-07-05 17:13:37 +00:00
Konstantin Belousov
a66d7a8ddc Copyout(9) on 4/4 i386 needs correct vm_page_array[].
On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold()
on arbitrary userspace mapping.  If the mapping is backed by the
non-managed cdev pager or by the sg pager, on dense configs we might
access arbitrary element of vm_page_array[], in particular, not
corresponding to a page from the memory segment.  Initialize such pages
as fictitious with the corresponding physical address.

Reported by:	bde
Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16085
2018-07-05 16:43:15 +00:00
Alan Cox
89ea39a727 Update the physical page selection strategy used by vm_page_import() so
that it does not cause rapid fragmentation of the free physical memory.

Reviewed by:	jeff, markj (an earlier version)
Differential Revision:	https://reviews.freebsd.org/D15976
2018-06-26 18:29:56 +00:00
Jonathan T. Looney
16e05b3275 Fix a typo in vm_domain_set(). When a domain crosses into the severe range,
we need to set the domain bit from the vm_severe_domains bitset (instead
of clearing it).

Reviewed by:	jeff, markj
Sponsored by:	Netflix, Inc.
2018-06-07 13:29:54 +00:00
Mark Johnston
a99ee60b9a Ensure that "m" is initialized in vm_page_alloc_freelist_domain().
While here, remove a superfluous comment.

Coverity CID:	1383559
MFC after:	3 days
2018-05-22 16:19:48 +00:00
Mark Johnston
ba2b3349e1 Fix a race in vm_page_pagequeue_lockptr().
The value of m->queue must be cached after comparing it with PQ_NONE,
since it may be concurrently changing.

Reported by:	glebius
Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D15462
2018-05-17 04:27:08 +00:00
Mark Johnston
1b5c869d64 Fix some races introduced in r332974.
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15280
2018-05-04 17:17:30 +00:00
Mark Johnston
5cd29d0f3c Improve VM page queue scalability.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by:	kib, jeff (previous versions)
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14893
2018-04-24 21:15:54 +00:00
Mark Johnston
64b3893010 Initialize marker pages in vm_page_domain_init().
They were previously initialized by the corresponding page daemon
threads, but for vmd_inacthead this may be too late if
vm_page_deactivate_noreuse() is called during boot.

Reported and tested by:	cperciva
Reviewed by:	alc, kib
MFC after:	1 week
2018-04-19 14:09:44 +00:00
Mark Johnston
9de8fcfddf Ensure that m and skip_m belong to the same object.
Pages allocated from a given reservation may belong to different
objects. It is therefore possible for vm_page_ps_test() to be called
with the base page's object unlocked. Check for this case before
asserting that the object lock is held.

Reported by:	jhb
Reviewed by:	kib
MFC after:	1 week
2018-04-17 18:49:17 +00:00
Konstantin Belousov
e55d32b7b3 Handle Skylake-X errata SKZ63.
SKZ63 Processor May Hang When Executing Code In an HLE Transaction
Region

Problem: Under certain conditions, if the processor acquires an HLE
(Hardware Lock Elision) lock via the XACQUIRE instruction in the Host
Physical Address range between 40000000H and 403FFFFFH, it may hang
with an internal timeout error (MCACOD 0400H) logged into
IA32_MCi_STATUS.

Move the pages from the range into the blacklist.  Add a tunable to
not waste 4M if local DoS is not the issue.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15001
2018-04-07 17:06:13 +00:00
Jeff Roberson
c33e3a642b Add a uma cache of free pages in the DEFAULT freepool. This gives us
per-cpu alloc and free of pages.  The cache is filled with as few trips
to the phys allocator as possible by the use of a new
vm_phys_alloc_npages() function which allocates as many as N pages.

This code was originally by markj with the import function rewritten by
me.

Reviewed by:	markj, kib
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14905
2018-04-01 04:50:05 +00:00
Jeff Roberson
e5818a53db Implement several enhancements to NUMA policies.
Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.

Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user.  This
also eliminates the need for the complicated interrupt binding code.

Add a sysctl API for viewing and manipulating domainsets.  Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both.  This probably belongs in a dedicated subr file.

Attempt to improve the include situation.

Reviewed by:	kib
Discussed with:	jhb (cpuset parts)
Tested by:	pho (before review feedback)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14839
2018-03-29 02:54:50 +00:00
Jeff Roberson
2d3f4181de Fix two compliation problems on non-amd64 architectures. 2018-03-23 18:24:02 +00:00
Mark Johnston
4046851367 Correct a couple of assertion messages in vm_page_reclaim_run().
MFC after:	3 days
2018-03-23 14:38:56 +00:00
Jeff Roberson
5c930c894d Lock reservations with a dedicated lock in each reservation. Protect the
vmd_free_count with atomics.

This allows us to allocate and free from reservations without the free lock
except where a superpage is allocated from the physical layer, which is
roughly 1/512 of the operations on amd64.

Use the counter api to eliminate cache conention on counters.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14707
2018-03-22 19:21:11 +00:00
Jeff Roberson
9a4b4cd3bc Start witness much earlier in boot so that we can shrink the pend list and
make it more immune to further change.

Reviewed by:	markj, imp (Part of D14707)
Sponsored by:	Netflix, Dell/EMC Isilon
2018-03-22 19:11:43 +00:00
Mark Johnston
0eb50f9cd2 Have vm_page_{deactivate,launder}() requeue already-queued pages.
In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.

Reviewed by:	alc
Tested by:	pho
MFC after:	2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625
2018-03-18 16:40:56 +00:00
Mark Johnston
434862acb1 Have vm_page_replace() assert that the new page is not enqueued.
The new page does not belong to a VM object, but the page daemon does
not expect to encounter such pages.

Reviewed by:	alc, kib
Tested by:	pho
MFC after:	1 week
X-Differential Revision: https://reviews.freebsd.org/D14625
2018-03-18 16:35:40 +00:00
Jeff Roberson
30fbfdda6c Eliminate pageout wakeup races. Take another step towards lockless
vmd_free_count manipulation.  Reduce the scope of the free lock by
using a pageout lock to synchronize sleep and wakeup.  Only trigger
the pageout daemon on transitions between states.  Drive all wakeup
operations directly as side-effects from freeing memory rather than
requiring an additional function call.

Reviewed by:	markj, kib
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14612
2018-03-15 19:23:07 +00:00
Konstantin Belousov
741e1c9196 Revert the chunk from r330410 in vm_page_reclaim_run().
There, the pages freed might be managed but the page's lock is not
owned.  For KPI correctness, the page lock is requried around the call
to vm_page_free_prep(), which is asserted.  Reclaim loop already did
the work which could be done by vm_page_free_prep(), so the lock is
not needed and the only consequence of not owning it is the assert
trigger.

Instead of adding the locking to satisfy the assert, revert to the
code that calls vm_page_free_phys() directly.

Reported by:	pho
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-13 18:27:23 +00:00
Konstantin Belousov
2a8e8f7892 Remove redundant test from r330410.
If the input slist is non-empty, counter cannot be zero after freeing.

Noted by:	mjg
MFC after:	2 weeks
2018-03-04 21:15:31 +00:00
Konstantin Belousov
8c8ee2ee1c Unify bulk free operations in several pmaps.
Submitted by:	Yoshihiro Ota
Reviewed by:	markj
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13485
2018-03-04 20:53:20 +00:00
Mark Johnston
9140bff7ed Remove a bogus assertion from vm_page_launder().
After r328977, a wired page m may have m->queue != PQ_NONE.

Reviewed by:	kib
X-MFC with:	r328977
Differential Revision:	https://reviews.freebsd.org/D14485
2018-02-23 23:25:22 +00:00
Jeff Roberson
5f8cd1c0bf Add a generic Proportional Integral Derivative (PID) controller algorithm and
use it to regulate page daemon output.

This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload.  This is a reimplementation of work done by myself and mlaier at
Isilon.

Reviewed by:	bsdimp
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14402
2018-02-23 22:51:51 +00:00
Konstantin Belousov
2c0f13aa59 vm_wait() rework.
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass.  If there is no object
associated with the wait, use curthread' policy domainset.  The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.

Eliminate pagedaemon_wait().  vm_domain_clear() handles the same
operations.

Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.

Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.

Scetched and reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation, Mellanox Technologies
Differential revision:	https://reviews.freebsd.org/D14384
2018-02-20 10:13:13 +00:00
Jeff Roberson
e958ad4cf3 Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc().  Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by:	markj
Discussed with:	kib, glebius
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14273
2018-02-12 22:53:00 +00:00
Gleb Smirnoff
f7d3578564 Fix boot_pages exhaustion on machines with many domains and cores, where
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we  need to postpone switch
off of this zone from startup_alloc() until full launch of VM.

o Always supply number of VM zones to uma_startup_count(). On machines
  with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
  a page. In the latter case account VM zones for number of allocations
  from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
  itself any zone that is already capable of running real alloc.
  In worst case scenario we may leak a single page here. See comment
  in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
  extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
  pages calculation, we are guaranteed to use all of the boot_pages
  in the early single threaded stage.

Reported & tested by:	mav
2018-02-09 04:45:39 +00:00
Gleb Smirnoff
5073a08328 Fix three miscalculations in amount of boot pages:
o Most of startup zones have struct uma_slab embedded into the slab,
  so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
  when calculating how many pages would certain kind of allocations
  require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
  need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
  and this also needs to be accounted for. [2]

While here, spread more comments and improve diagnostic messages.

Reported by:	pho [1], jtl [2]
2018-02-07 18:32:51 +00:00
Mark Johnston
1d3a1bcfac Dequeue wired pages lazily.
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by:	jeff
Discussed with:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D11943
2018-02-07 16:57:10 +00:00
Jeff Roberson
e2068d0bcd Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
Gleb Smirnoff
ae941b1b4e Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC.
o Call uma_startup1() after initializing kmem, vmem and domains.
o Include 8 eight VM startup pages into uma_startup_count() calculation.
o Account for vmem_startup() and vm_map_startup() preallocating pages.
o Account for extra two allocations done by kmem_init() and vmem_create().
o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT
  allowed several other SYSINITs to sneak in before it, thus bumping
  requirement for amount of boot pages.
2018-02-06 22:06:59 +00:00
Gleb Smirnoff
f4bef67c9c Followup on r302393 by cperciva, improving calculation of boot pages required
for UMA startup.

o Introduce another stage of UMA startup, which is entered after
  vm_page_startup() finishes. After this stage we don't yet enable buckets,
  but we can ask VM for pages. Rename stages to meaningful names while here.
  New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
  BOOT_RUNNING.
  Enabling page alloc earlier allows us to dramatically reduce number of
  boot pages required. What is more important number of zones becomes
  consistent across different machines, as no MD allocations are done before
  the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
  startup_alloc(), however that may change, so vm_page_startup() provides
  its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
  functions calculates sizes of zones zone and kegs zone, and calculates how
  many pages UMA will need to bootstrap.
  It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
  mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
  but also as args.size.

Reviewed by:		imp, gallatin (earlier version)
Differential Revision:	https://reviews.freebsd.org/D14054
2018-02-06 04:16:00 +00:00
Nathan Whitehorn
9a8196ce19 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
Jeff Roberson
3f289c3fcf Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by:	markj, kib
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13403
2018-01-12 22:48:23 +00:00
Mark Johnston
280d15cd0a Fix two problems with the page daemon control loop.
Both issues caused the page daemon to erroneously go to sleep when
applications are consuming free pages at a high rate, leaving the
application threads blocked in VM_WAIT.

1) After completing an inactive queue scan, concurrent allocations may
   have prevented the page daemon from meeting the v_free_min threshold.
   In this case, the page daemon was going to sleep even when the
   inactive queue contained plenty of clean pages.
2) pagedaemon_wakeup() may be called without the free queues lock held.
   This can lead to a lost wakeup if a call occurs after the page daemon
   clears vm_pageout_wanted but before going to sleep.

Fix 1) by ensuring that we start a new inactive queue scan immediately
if v_free_count < v_free_min after a prior scan.

Fix 2) by adding a new subroutine, pagedaemon_wait(), called from
vm_wait() and vm_waitpfault(). It wakes up the page daemon if either
vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps
on v_free_count.

Reported by:	jeff
Reviewed by:	alc
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D13424
2017-12-24 19:45:16 +00:00
Michael Zhilin
0db2102aaa [mips] [vm] restore translation of freelist to flind for page allocation
Commit r326346 moved domain iterators from physical layer to vm_page one,
but it also removed translation of freelist to flind for
vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter,
but after it expect freelist index.

On small WiFi boxes with few megabytes of RAM, there is only one freelist
VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file
sys/mips/include/vmparam.h). It results in freelist 1 with flind 0.

At first, this commit renames flind to freelist in vm_page_alloc_freelist
to avoid misunderstanding about input parameters. Then on physical layer it
restores translation for correct handling of freelist parameter.

Reported by:	landonf
Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D13351
2017-12-04 08:08:55 +00:00
Pedro F. Giffuni
796df753f4 SPDX: Consider code from Carnegie-Mellon University.
Interesting cases, most likely from CMU Mach sources.
2017-11-30 15:48:35 +00:00
Mark Johnston
1084894f80 Remove some comments that became incorrect with r325530. 2017-11-29 14:34:05 +00:00
Jeff Roberson
ef435ae7de Move domain iterators into the page layer where domain selection should take
place.  This makes the majority of the phys layer explicitly domain specific.

Reviewed by:	markj, kib (some objections)
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix & Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13014
2017-11-28 23:18:35 +00:00
Mark Johnston
d2b677cef6 Avoid unnecessary lookups when initializing the vm_page array.
This gives a marginal improvement in the vm_page_array initialization
time. Also garbage-collect the now-unused vm_phys_paddr_to_segind().

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13270
2017-11-27 17:46:38 +00:00
Mark Johnston
b20bf182e6 Move vm_phys_init_page() to vm_page.c.
Suggested by:	kib
Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13250
2017-11-26 19:17:55 +00:00
Mark Johnston
5070d56d41 Allow for fictitious physical pages in vm_page_scan_contig().
Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to
satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter
fictitous pages. For now, allow for this possibility; such pages will be
skipped later in the scan since they are wired.

Reported by:	avg
Reviewed by:	kib
MFC after:	1 week
2017-11-21 13:17:40 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Jeff Roberson
8d6fbbb867 Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated.  This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons.  This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by:	alc, kib, markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2017-11-08 02:39:37 +00:00
Alan Cox
3a757e5403 Micro-optimize the handling of fictitious pages in vm_page_free_prep().
A fictitious page is always wired, so there is no point in trying to
remove one from the page queues.

Completely remove one inaccurate comment from vm_page_free_prep() and
correct another.

Reviewed by:	kib, markj
MFC after:	1 week
2017-10-24 17:14:53 +00:00
Konstantin Belousov
422fe502b3 Check that the page which is freed as zeroed, indeed has all-zero content.
This catches some rare mysterious failures at the source.  The check
is only performed on architectures which implement direct map, and
only enabled with option DIAGNOSTIC, similar to other costly
consistency checks.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-10-21 17:28:12 +00:00
Konstantin Belousov
4313989360 In vm_page_free_phys_pglist(), do not take vm_page_queue_free_mtx if
there is nothing to do.

Suggested by:	mjg
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-20 08:25:49 +00:00
Mateusz Guzik
1dbf52e7d9 Reduce traffic on vm_cnt.v_free_count
The variable is modified with the highly contended page free queue lock.
It unnecessarily shares a cacheline with purely read-only fields and is
re-read after the lock is dropped in the page allocation code making the
hold time longer.

Pad the variable just like the others and store the value as found with
the lock held instead of re-reading.

Provides a modest 1%-ish speed up in concurrent page faults.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D12665
2017-10-13 21:54:34 +00:00
Alan Cox
43cc906f40 Change vm_page_try_to_free() to require a managed page. Essentially,
vm_page_try_to_free() is testing conditions, like clean versus dirty,
that only vary in managed pages.

Suggested by:	kib
Reviewed by:	markj
X-MFC after:	never
2017-09-24 23:35:01 +00:00
Alan Cox
494c6e43d3 Optimize vm_page_try_to_free(). Specifically, the call to pmap_remove_all()
can be avoided when the page's containing object has a reference count of
zero.  (If the object has a reference count of zero, then none of its pages
can possibly be mapped.)

Address nearby style issues in vm_page_try_to_free(), and change its
return type to "bool".

Reviewed by:	kib, markj
MFC after:	1 week
2017-09-24 16:50:10 +00:00
Konstantin Belousov
e82e50e681 Remove inline specifier from vm_page_free_wakeup(), do not
micro-manage compiler.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-13 19:30:09 +00:00
Konstantin Belousov
540ac3b310 Split vm_page_free_toq() into two parts, preparation vm_page_free_prep()
and insertion into the phys allocator free queues vm_page_free_phys().
Also provide a wrapper vm_page_free_phys_pglist() for batched free.

Reviewed by:	alc, markj
Tested by:	mjg (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-13 19:11:52 +00:00
Mark Johnston
2934eb8a22 Fix a logic error in the item size calculation for internal UMA zones.
Kegs for internal zones always keep the slab header in the slab itself.
Therefore, when determining the allocation size, we need to take the
slab header size into account.

Reported and tested by:	ae, rakuco
Reviewed by:	avg
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12342
2017-09-13 15:44:54 +00:00
Konstantin Belousov
93c5d3a46a Add a vm_page_change_lock() helper, the common code to not relock page
lock if both old and new pages use the same underlying lock.  Convert
existing places to use the helper instead of inlining it.  Use the
optimization in vm_object_page_remove().

Suggested and reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-09 17:35:19 +00:00
Mark Johnston
f93f7cf199 Speed up vm_page_array initialization.
We currently initialize the vm_page array in three passes: one to zero
the array, one to initialize the "order" field of each page (necessary
when inserting them into the vm_phys buddy allocator one-by-one), and
one to initialize the remaining non-zero fields and individually insert
each page into the allocator.

Merge the three passes into one following a suggestion from alc:
initialize vm_page fields in a single pass, and use vm_phys_free_contig()
to efficiently insert physical memory segments into the buddy allocator.
This reduces the initialization time to a third or a quarter of what it
was before on most systems that I tested.

Reviewed by:	alc, kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D12248
2017-09-07 21:43:39 +00:00
Mateusz Guzik
fe933c1d88 Start annotating global _padalign locks with __exclusive_cache_line
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.

For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.

The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.

MFC after:	1 week
2017-09-06 20:28:18 +00:00
Mark Johnston
33fff5d536 Add vm_page_alloc_after().
This is a variant of vm_page_alloc() which accepts an additional parameter:
the page in the object with largest index that is smaller than the requested
index. vm_page_alloc() finds this page using a lookup in the object's radix
tree, but in some cases its identity is already known, allowing the lookup
to be elided.

Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after().
vm_page_alloc() is converted into a trivial wrapper of
vm_page_alloc_after().

Suggested by:	alc
Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11984
2017-08-15 16:39:49 +00:00
Mark Johnston
9df950b35d Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT.
This will allow its use in sendfile_swapin().

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11942
2017-08-11 16:29:22 +00:00
Mark Johnston
2c642ec1e7 Make vm_page_sunbusy() assert that the page is unlocked.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11946
2017-08-10 22:43:38 +00:00
Alan Cox
5471caf6f1 Introduce vm_page_grab_pages(), which is intended to replace loops calling
vm_page_grab() on consecutive page indices.  Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().

Reviewed by:	kib, markj
Tested by:	pho (an earlier version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11926
2017-08-09 04:23:04 +00:00
Alan Cox
1d3b9818e7 In vm_page_ps_test(), always check that the base pages within the specified
superpage all belong to the same object.  To date, that check has not been
needed, but upcoming changes require it.  (See the Differential Revision.)

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11556
2017-07-23 05:54:56 +00:00
Alan Cox
8830260128 Generalize vm_page_ps_is_valid() to support testing other predicates on
the (super)page, renaming the function to vm_page_ps_test().

Reviewed by:	kib, markj
MFC after:	1 week
2017-07-14 02:15:48 +00:00
John Baldwin
4bd7e351f1 Fix an off-by-one error in the VM page array on some systems.
r31386 changed how the size of the VM page array was calculated to be
less wasteful.  For most systems, the amount of memory is divided by
the overhead required by each page (a page of data plus a struct vm_page)
to determine the maximum number of available pages.  However, if the
remainder for the first non-available page was at least a page of data
(so that the only memory missing was a struct vm_page), this last page
was left in phys_avail[] but was not allocated an entry in the VM page
array.  Handle this case by explicitly excluding the page from
phys_avail[].

Reviewed by:	alc
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D11000
2017-06-08 16:18:41 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Alan Cox
8a99f1cc59 Over the years, the code and comments in vm_page_startup() have diverged in
one respect.  When determining how many page structures to allocate,
contrary to what the comments say, the code does not account for the
overhead of a page structure per page of physical memory.  This revision
changes the code to match the comments.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D9081
2017-02-04 05:23:10 +00:00
Mark Johnston
c2655a40a7 Avoid unnecessary page lookups in vm_object_madvise().
vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.

While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9098
2017-01-15 03:50:08 +00:00
Gleb Smirnoff
bfc8c24c73 Move bogus_page declaration to vm_page.h and initialization to vm_page.c.
Reviewed by:	kib
2017-01-04 22:27:19 +00:00
Mark Johnston
b1fd102ee7 Add a page queue for holding dirty anonymous unswappable pages.
On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.

Reviewed by:	alc
2017-01-03 00:05:44 +00:00
Konstantin Belousov
0c8bd6a7d8 Assert that the pages found on the object queue by vm_page_next() and
vm_page_prev() have correct ownership.

In collaboration with:	alc
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2016-12-30 17:37:06 +00:00
Alan Cox
920da7e4d2 Relax the object type restrictions on vm_page_alloc_contig(). Specifically,
add support for object types that were previously prohibited because they
could contain PG_CACHED pages.

Roughly halve the number of radix trie operations performed by
vm_page_alloc_contig() using the same approach that is employed by
vm_page_alloc().  Also, eliminate the radix trie lookup performed with the
free page queues lock held.

Tidy up the handling of radix trie insert failures in vm_page_alloc() and
vm_page_alloc_contig().

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8878
2016-12-28 18:32:13 +00:00
Alan Cox
3453bca864 Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system.  Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8752
2016-12-12 17:47:09 +00:00
Alan Cox
e94965d82e Previously, vm_radix_remove() would panic if the radix trie didn't
contain a vm_page_t at the specified index.  However, with this
change, vm_radix_remove() no longer panics.  Instead, it returns NULL
if there is no vm_page_t at the specified index.  Otherwise, it
returns the vm_page_t.  The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations.  Instead of performing a lookup before every remove,
the pmap can simply perform the remove.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D8708
2016-12-08 04:29:29 +00:00
Alan Cox
ba67369628 Recursion on the free page queue mutex occurred when UMA needed to allocate
a new page of radix trie nodes to complete a vm_radix_insert() operation
that was requested by vm_page_cache().  Specifically, vm_page_cache()
already held the free page queue lock when UMA tried to acquire it through
a call to vm_page_alloc().  This code path no longer exists, so there is no
longer any reason to allow recursion on the free page queue mutex.

Improve nearby comments.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8628
2016-11-27 01:42:53 +00:00
Alan Cox
bba39b9ae3 Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used.  More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8583
2016-11-22 18:13:46 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Alan Cox
ebcddc7217 Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue.  A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them.  The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time.  In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low.  Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system.  Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages.  This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s).  A request is raised by setting the variable
vm_laundry_request and waking the laundry thread.  We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering").  When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering.  If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s.  Otherwise, the laundry thread goes back
to sleep without doing any work.  When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target.  In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced.  In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache().  The new meaning
of vm_cnt.v_reactivated now better reflects its name.  It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with:	markj
Reviewed by:	kib
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8302
2016-11-09 18:48:37 +00:00
Konstantin Belousov
1771e987ca Do not sleep in vm_wait() if pagedaemon did not yet started. Panic instead.
Requests which cannot be satisfied by allocators at boot time often
have unrealizable parameters.  Waiting for the pagedaemon' start would
hang the boot if done in the thread0 context and just never succeed if
executed from another thread.  In fact, for very early stages, sleep
attempt panics with obscure diagnostic about the scheduler state, and
explicit panic in vm_wait() makes the investigation much shorter by
cut off the examination of the thread and scheduler.

Theoretically, some subsystem might grab a resource to exhaustion, and
free it later in the boot process.  If this unlikely scenario does
appear for real, the way to diagnose the trouble can be revisited.

Reported by:	emaste
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8421
2016-11-04 12:58:50 +00:00
Konstantin Belousov
bd9546a21c Export vm_page_xunbusy_maybelocked().
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D8197
2016-10-17 08:14:23 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Konstantin Belousov
267ed8e2f7 When downgrading exclusively busied page to shared-busy state, wakeup
waiters.  Otherwise, owners of the shared-busy state are left blocked
and might get into a deadlock.

Note that the vm_page_busy_downgrade() function is not used in the
tree right now.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8195
2016-10-11 18:09:37 +00:00
Alan Cox
70cf3ced3c Make the page daemon's notion of what kind of pass is being performed
by vm_pageout_scan() local to vm_pageout_worker().  There is no reason
to store the pass in the NUMA domain structure.

Reviewed by:	kib
MFC after:	3 weeks
2016-10-05 17:32:06 +00:00
Mark Johnston
dbbaf04f1e Remove support for idle page zeroing.
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by:	alc, imp, kib
Differential Revision:	https://reviews.freebsd.org/D7714
2016-09-03 20:38:13 +00:00
Mark Johnston
915d1b71cd Restore swap pager readahead after r292373.
The removal of vm_fault_additional_pages() meant that a hard fault on
a swap-backed page would result in only that page being read in. This
change implements readahead and readbehind for the swap pager in
swap_pager_getpages(). swap_pager_haspage() is modified to return the
largest contiguous non-resident range of pages containing the requested
range.

Reviewed by:	alc, kib
Tested by:	pho
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7677
2016-08-30 05:56:21 +00:00
Mark Johnston
842ee21e20 Strengthen assertions about the busy state of newly-allocated pages.
Reviewed by:	alc
MFC after:	1 week
2016-08-13 19:49:32 +00:00
Mark Johnston
897d0c6617 Use vm_page_undirty() instead of manually setting a page field.
Reviewed by:	alc
MFC after:	3 days
2016-07-29 21:05:37 +00:00
Mark Johnston
efe1ff4cf0 Update a comment in vm_page_advise() to match behaviour after r290529.
Reviewed by:	alc
MFC after:	3 days
2016-07-23 21:02:36 +00:00
Colin Percival
34caa842a4 Autotune the number of pages set aside for UMA startup based on the number
of CPUs present.  On amd64 this unbreaks the boot for systems with 92 or
more CPUs; the limit will vary on other systems depending on the size of
their uma_zone and uma_cache structures.

The major consumer of pages during UMA startup is the 19 zone structures
which are set up before UMA has bootstrapped itself sufficiently to use
the rest of the available memory:  UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 /
16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP,
KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg.  If the zone structures occupy
more than one page, they will not share pages and the number of pages
currently needed for startup is 19 * pages_per_zone + N, where N is the
number of pages used for allocating other structures; on amd64 N = 3 at
present (2 pages are allocated for UMA Kegs, and one page for UMA Hash).

This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32,
and if a zone structure does not fit into a single page sets boot_pages to
UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which
remains at 64).  Consequently this patch has no effect on systems where the
zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but
increases boot_pages sufficiently on systems where the large number of CPUs
makes this structure larger.  It seems safe to assume that systems with 60+
CPUs can afford to set aside an additional 128kB of memory per 32 CPUs.

The vm.boot_pages tunable continues to override this computation, but is
unlikely to be necessary in the future.

Tested on:	EC2 x1.32xlarge
Relnotes:	FreeBSD can now boot on 92+ CPU systems without requiring
		vm.boot_pages to be manually adjusted.
Reviewed by:	jeff, alc, adrian
Approved by:	re (kib)
2016-07-07 18:37:12 +00:00
Konstantin Belousov
35e8002c58 In vm_page_xunbusy_maybelocked(), add fast path for unbusy when no
waiters exist, same as for vm_page_xunbusy().  If previous value of
busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is
not needed.

Move common code from vm_page_xunbusy_maybelocked() and
vm_page_xunbusy_hard() to vm_page_xunbusy_locked().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-23 08:28:13 +00:00
Mark Johnston
0a1dc6e23c Reset the page busy lock state after failing to insert into the object.
Freeing a shared-busy page is not permitted.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D6670
2016-06-02 17:11:24 +00:00
Mark Johnston
e705296958 Don't preserve the page's object linkage in vm_page_insert_after().
Per the KASSERT at the beginning of the function, we expect that the page
does not belong to any object, so its object and pindex fields are
meaningless. Reset them in the rare case that vm_radix_insert() fails.

Reviewed by:	kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D6669
2016-06-02 16:58:47 +00:00
Konstantin Belousov
e5f0191f20 If the fast path unbusy in vm_page_replace() fails, slow path needs to
acquire the page lock, which recurses.  Avoid the recursion by reusing
the code from vm_page_remove() in a new helper
vm_page_xunbusy_maybelocked().

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2016-06-01 20:39:00 +00:00