Commit Graph

432 Commits

Author SHA1 Message Date
Konstantin Belousov
a823302783 Allow to create swap zone larger than v_page_count / 2.
If user configured the maxswapzone tunable, just take the literal
value for the initial zone sizing attempt.  Before, it was only
possible to reduce the zone by the tunable.

While there, correct the message which was not correct when zone
creation rounded the size up.

Reported by:	jmg
Reviewed by:	markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D18381
2018-12-01 16:50:12 +00:00
Alan Cox
541a117532 Use swp_pager_isondev() throughout. Submitted by: ota@j.email.ne.jp
Change swp_pager_isondev()'s return type to bool.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16712
2018-11-19 17:17:23 +00:00
Mark Johnston
150d384e5c Fix a use-after-free in swp_pager_meta_free().
This was introduced in r326329 and explains the crashes mentioned in
the commit log message for r339934.  In particular, on INVARIANTS
kernels, UMA trashing causes the loop to exit early, leaving swap
blocks behind when they should have been freed.  After r336984 this
became more problematic since new anonymous mappings were more
likely to reuse swapped-out subranges of existing VM objects, so faults
would trigger pageins of freed memory rather than returning zeroed
pages.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17897
2018-11-07 23:28:11 +00:00
Matt Macy
e8bb589d56 eliminate locking surrounding ui_vmsize and swap reserve by using atomics
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.

Results in mmap speed up and a 70% increase in brk calls / second.

Reviewed by:	alc@, markj@, kib@
Approved by:	re (delphij@)
Differential Revision:	https://reviews.freebsd.org/D16273
2018-10-05 05:50:56 +00:00
Alan Cox
f5fbe90de4 Passing UMA_ZONE_NOFREE to uma_zcreate() for swpctrie_zone and swblk_zone is
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone.  (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)

Reviewed by:	kib, markj
Approved by:	re (gjb)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17296
2018-09-24 16:49:02 +00:00
Alan Cox
78f1deeffe Defer and aggregate swap_pager_meta_build frees.
Before swp_pager_meta_build replaces an old swapblk with an new one,
it frees the old one.  To allow such freeing of blocks to be
aggregated, have swp_pager_meta_build return the old swap block, and
make the caller responsible for freeing it.

Define a pair of short static functions, swp_pager_init_freerange and
swp_pager_update_freerange, to do the initialization and updating of
blk addresses and counters used in aggregating blocks to be freed.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	kib, markj (an earlier version)
Tested by:	pho
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13707
2018-08-08 02:30:34 +00:00
Konstantin Belousov
b7b8a09658 Handle the race between fork/vm_object_split() and faults.
If fault started before vmspace_fork() locked the map, and then during
fork, vm_map_copy_entry()->vm_object_split() is executed, it is
possible that the fault instantiate the page into the original object
when the page was already copied into the new object (see
vm_map_split() for the orig/new objects terminology). This can happen
if split found a busy page (e.g. from the fault) and slept dropping
the objects lock, which allows the swap pager to instantiate
read-behind pages for the fault.  Then the restart of the scan can see
a page in the scanned range, where it was already copied to the upper
object.

Fix it by instantiating the read-ahead pages before
swap_pager_getpages() method drops the lock to allocate pbuf.  The
object scan would see the whole range prefilled with the busy pages
and not proceed the range.

Note that vm_fault rechecks the map generation count after the object
unlock, so that it restarts the handling if raced with split, and
re-lookups the right page from the upper object.

In collaboration with:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-06-14 19:41:02 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Mark Johnston
3f060b60b1 Use the conventional name for an array of pages.
No functional change intended.

Discussed with:	kib
MFC after:	3 days
2018-02-16 15:38:22 +00:00
Jeff Roberson
e958ad4cf3 Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc().  Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by:	markj
Discussed with:	kib, glebius
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14273
2018-02-12 22:53:00 +00:00
Jeff Roberson
e2068d0bcd Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
Alan Cox
4abca9bb05 Previously, swap_pager_copy() freed swap blocks one at at time, via
swp_pager_meta_ctl(), with no opportunity to recognize freeing of
consecutive blocks and free fewer block ranges.  To open that opportunity,
this change removes the SWM_FREE option from swp_pager_meta_ctl(), and
compels the caller to do the freeing when a valid block address is returned.
In swap_pager_copy(), these frees are aggregated, so that a sequence of them
can be done at one time.

The only other caller to swp_pager_meta_ctl() that passed SWM_FREE,
swp_pager_unswapped(), is also modified to handle its single free
explicitly.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	kib (an earlier version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13290
2017-12-31 04:01:47 +00:00
Alan Cox
230869e051 When the swap pager allocates space on disk, it requests contiguous
blocks in a single call to blist_alloc().  However, when it frees
that space, it previously called blist_free() on each block, one at a
time.  With this change, the swap pager identifies ranges of
contiguous blocks to be freed, and calls blist_free() once per
range.  In one extreme case, that is described in the review, the time
to perform an munmap(2) was reduced by 55%.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12397
2017-11-28 17:46:03 +00:00
Pedro F. Giffuni
df57947f08 spdx: initial adoption of licensing ID tags.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13133
2017-11-18 14:26:50 +00:00
Jeff Roberson
8d6fbbb867 Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated.  This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons.  This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by:	alc, kib, markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2017-11-08 02:39:37 +00:00
Edward Tomasz Napierala
be7d4ac586 Add OID for the vm.overcommit sysctl. This makes it possible to remove
one call to sysctl(2) from jemalloc startup code. (That also requires
changes to jemalloc, but I plan to push those to upstream first.)

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D12745
2017-10-22 10:35:29 +00:00
Konstantin Belousov
1fffcd755d Do not report reduction of swap zone if it was not.
After r324600 we see the actual reservation.

Reported by:	jkim
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-18 07:27:43 +00:00
Konstantin Belousov
53faf5a7d4 Evaluate the real size of the sblk_zone.
Submitted by:	ota@j.email.ne.jp
PR:	221356
Reviewed by:	alc, markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D12660
2017-10-13 16:23:05 +00:00
Alan Cox
37244a84fd Replace an unnecessary call to vm_page_activate() by an assertion that
the page is already wired or queued.  Prior to the elimination of PG_CACHED
pages, vm_page_grab() might have returned a valid, previously PG_CACHED
page, in which case enqueueing the page was necessary.  Now, that can't
happen.  Moreover, activating the page is a dubious choice, since the page
is not being accessed.

Reviewed by:	kib
MFC after:	1 week
2017-10-08 16:54:42 +00:00
Alan Cox
41e5a22698 When an I/O error occurs on page out, there is no need to dirty the page,
because it is already dirty.  Instead, assert that the page is dirty.

Reviewed by:	kib, markj
MFC after:	1 week
2017-10-01 17:04:26 +00:00
Alan Cox
d027ed2e7a To analyze the allocation of swap blocks by blist functions, add a method
for analyzing the radix tree structures and reporting on the number, and
sizes, of maximal intervals of free blocks.  The report includes the number
of maximal intervals, and also the number of them in each of several size
ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size
boundaries defined by Fibonacci numbers.  The report is written in the test
tool with the 's' command, or in a running kernel by sysctl.

The analysis of the radix tree frequently computes the position of the lone
bit set in a u_daddr_t, a computation that also appears in leaf allocation.
That computation has been moved into a function of its own, and optimized
for cases where an inlined machine instruction can replace the usual binary
search.

Submitted by:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11906
2017-09-10 17:46:03 +00:00
Konstantin Belousov
85d88d8799 Do not leak empty swblk.
In swp_pager_meta_build(), if the requested operation results in
freeing the last swap pointer in the swblk, free the trie node.  Other
swap pager code does not expect to find completely empty swblk.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-06 16:18:53 +00:00
Konstantin Belousov
eed99cb81b In swp_pager_meta_build(), handle a race with other thread allocating
swapblk for our index while we dropped the object lock.

Noted by:	jeff
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-06 16:16:11 +00:00
Konstantin Belousov
35872e79b7 Adjust interface of swapon_check_swzone() to its actual usage.
The function return value is not used.  Its argument is always
swap_total/PAGE_SIZE, so make it not take any arguments.

Submitted by:	ota@j.email.ne.jp
PR:	221356
MFC after:	1 week
2017-08-30 10:17:00 +00:00
Konstantin Belousov
f08b30995a Make the swap_pager_full variable static.
r290920 removed the use of the variable from vm/vm_pageout.c.

Submitted by:	ota@j.email.ne.jp
PR:	221356
MFC after:	1 week
2017-08-30 09:44:05 +00:00
Alan Cox
ee620ea47d Update a couple vm_object lock assertions in the swap pager to reflect the
new use of the vm_object's lock to synchronize updates to a radix trie
mapping per-vm object page indices to on-disk swap blocks.

Fix a typo in a nearby comment.

Reviewed by:	kib, markj
X-MFC with:	r322913
Differential Revision:	https://reviews.freebsd.org/D12134
2017-08-28 17:02:25 +00:00
Konstantin Belousov
f425ab8e50 Replace global swhash in swap pager with per-object trie to track swap
blocks assigned to the object pages.

- The global swhash_mtx is removed, trie is synchronized by the
  corresponding object lock.
- The swp_pager_meta_free_all() function used during object
  termination is optimized by only looking at the trie instead of
  having to search whole hash for the swap blocks owned by the object.
- On swap_pager_swapoff(), instead of iterating over the swhash,
  global object list have to be inspected. There, we have to ensure
  that we do see valid trie content if we see that the object type is
  swap.
Sizing of the swblk zone is same as for swblock zone, each swblk maps
SWAP_META_PAGES pages.

Proposed by:	alc
Reviewed by:	alc, markj (previous version)
Tested by:	alc, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D11435
2017-08-25 23:13:21 +00:00
Konstantin Belousov
9680bb9877 Remove unused function swap_pager_isswapped().
Noted by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-19 17:28:46 +00:00
Alan Cox
e22415906d Increase the pageout cluster size to 32 pages.
Decouple the pageout cluster size from the size of the hash table entry
used by the swap pager for mapping (object, pindex) to a block on the
swap device(s), and keep the size of a hash table entry at its current
size.

Eliminate a pointless macro.

Reviewed by:	kib, markj (an earlier version)
MFC after:	4 weeks
Differential Revision:	https://reviews.freebsd.org/D11305
2017-06-24 17:10:33 +00:00
Alan Cox
3a5d839ebc Eliminate an unused macro.
MFC after:	3 days
2017-06-21 03:55:45 +00:00
Alan Cox
87b0ab69a9 Pages that are passed to swap_pager_putpages() should already be fully
dirty.  Assert that they are fully dirty rather than redundantly calling
vm_page_dirty() on them.

Reviewed by:	kib, markj
MFC after:	1 week
X-MFC after:	r319932
2017-06-17 03:05:25 +00:00
Alan Cox
761097c85e Starting in r118390, swaponsomething() began to reserve the blocks at the
beginning of a swap area for a disk label.  However, neither r118390 nor
r118544, which increased the reservation from one to two blocks, correctly
accounted for these blocks when updating the variable "swap_pager_avail".
This change corrects that error.

Reviewed by:	kib
MFC after:	5 days
2017-06-06 16:52:07 +00:00
Alan Cox
03bdd65f18 When the function blist_fill() was added to the kernel in r107913, the swap
pager used a different scheme for striping the allocation of swap space
across multiple devices.  And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size.  Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected.  Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme.  This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.

Reviewed by:	kib
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11043
2017-06-06 03:32:17 +00:00
Alan Cox
064650c180 Halve the memory being internally allocated by the blist allocator. In
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int.  (See r96849 and
r96851.)

Reviewed by:	kib, markj
Tested by:	pho
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11028
2017-06-05 17:14:16 +00:00
Alan Cox
07c348ea7b After r118390, the variable "dmmax" was neither the correct strip size
nor the correct maximum block size.  Moreover, after r318995, it serves
no purpose except to provide information to user space through a read-
sysctl.

This change eliminates the variable "dmmax" but retains the sysctl.  It
also corrects the value returned by the sysctl.

Reviewed by:	kib, markj
MFC after:	3 days
2017-05-27 21:46:00 +00:00
Alan Cox
fe71561af2 In r118390, the swap pager's approach to striping swap allocation over
multiple devices was changed.  However, swapoff_one() was not fully and
correctly converted.  In particular, with r118390's introduction of a per-
device blist, the maximum swap block size, "dmmax", became irrelevant to
swapoff_one()'s operation.  Moreover, swapoff_one() was performing out-of-
range operations on the per-device blist that were silently ignored by
blist_fill().

This change corrects both of these problems with swapoff_one(), which will
allow us to potentially increase MAX_PAGEOUT_CLUSTER.  Previously,
swapoff_one() would panic inside of blist_fill() if you increased
MAX_PAGEOUT_CLUSTER.

Reviewed by:	kib, markj
MFC after:	3 days
2017-05-27 16:40:00 +00:00
Konstantin Belousov
6992112349 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Mark Johnston
b1fd102ee7 Add a page queue for holding dirty anonymous unswappable pages.
On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.

Reviewed by:	alc
2017-01-03 00:05:44 +00:00
Konstantin Belousov
2e56b64fa4 Fix argument type and microoptimize swp_pager_meta_free().
The count argument natural type if vm_pindex_t, but due to the loop
organization, it has to be signed type to detect the termination
condition.  Replace this logic by using distinguished counter for the
processed pages, and terminate loop when the counter exceeds the
argument.

Completely process one swblock for all relevant indexes instead of
doing relookup in hash when incrementing page index on the loop step.

Do not drop hash mutex around iterations.

Noted and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-24 09:57:31 +00:00
Konstantin Belousov
77d6fd97ef Improve vm_object_scan_all_shadowed() to also check swap backing objects.
As noted in the removed comment, it is possible and not prohibitively
costly to look up the swap blocks for the given page index.  Implement
a swap_pager_find_least() function to do that, and use it to iterate
simultaneously over both backing object page queue and swap
allocations when looking for shadowed pages.

Testing shows that number of new succesful scans, enabled by this
addition, is small but non-zero.  When worked out, the change both
further reduces the depth of the shadow object chain, and frees unused
but allocated swap and memory.

Suggested and reviewed by:	alc
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-18 20:56:14 +00:00
Konstantin Belousov
71057cd207 In swp_pager_meta_free_all(), fix type of the index variable. Style.
Noted and reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-16 23:33:37 +00:00
Alan Cox
bba39b9ae3 Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used.  More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8583
2016-11-22 18:13:46 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Alan Cox
ebcddc7217 Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue.  A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them.  The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time.  In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low.  Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system.  Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages.  This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s).  A request is raised by setting the variable
vm_laundry_request and waking the laundry thread.  We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering").  When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering.  If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s.  Otherwise, the laundry thread goes back
to sleep without doing any work.  When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target.  In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced.  In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache().  The new meaning
of vm_cnt.v_reactivated now better reflects its name.  It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with:	markj
Reviewed by:	kib
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8302
2016-11-09 18:48:37 +00:00
Mark Johnston
dd9cb6da0b Respect the caller's hints when performing swap readahead.
The pager getpages interface allows the caller to bound the number of
readahead and readbehind pages, and vm_fault_hold() makes use of this
feature. These bounds were ignored after r305056, causing the swap pager
to potentially page in more than the specified number of pages.

Reported and reviewed by:	alc
X-MFC with:	r305056
2016-09-04 00:25:49 +00:00
Konstantin Belousov
9815066425 Make swapoff reliable.
The swap_pager_swapoff() function uses trylock for the object lock
before pagein, which means that either i/o to md(4) over swap, or
intensive page faults over swap pager objects might prevent swapoff()
from making any progress. Then the retry < 100 check fails and machine
panics.

If trylock fails, acquire the object lock in the blockable way and
restart the hash bucket walk.  Keep retries logic for now.

Reported and tested by:	pho
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7688
2016-08-31 14:49:58 +00:00
Mark Johnston
915d1b71cd Restore swap pager readahead after r292373.
The removal of vm_fault_additional_pages() meant that a hard fault on
a swap-backed page would result in only that page being read in. This
change implements readahead and readbehind for the swap pager in
swap_pager_getpages(). swap_pager_haspage() is modified to return the
largest contiguous non-resident range of pages containing the requested
range.

Reviewed by:	alc, kib
Tested by:	pho
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7677
2016-08-30 05:56:21 +00:00
Konstantin Belousov
0c657d22eb Explain why swapgeom_close_ev() is delegated.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-03 07:11:19 +00:00
Konstantin Belousov
88ad2d7b47 Do not delegate a work to geom event thread which can be done inline.
In particular, swapongeom_ev() needed event thread context when swap
pager configuration was performed under Giant and geom asserted that
Giant is not owned.  Now both of the reason went away.

On the other hand, note that swpageom_release() is called from the
bio_done context, and possible close cannot be performed inline.

Also fix some minor issues.  The swapgeom() function does not use the
td argument, remove it.  Recheck that the vnode passed is still VCHR
and not reclaimed after the lock.

Reviewed by:	mav
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-28 15:57:01 +00:00