The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033
paging.
Shadow objects are marked with a COLLAPSING flag while they are collapsing with
their backing object. This gives us an explicit test rather than overloading
paging-in-progress. While split is on-going we mark an object with SPLIT.
These two operations will modify the swap tree so they must be serialized
and swap_pager_getpages() can now directly detect these conditions and page
more conservatively.
Callers to vm_object_collapse() now will reliably wait for a collapse to finish
so that the backing chain is as short as possible before other decisions are
made that may inflate the object chain. For example, split, coalesce, etc.
It is now safe to run fault concurrently with collapse. It is safe to increase
or decrease paging in progress with no lock so long as there is another valid
ref on increase.
This change makes collapse more reliable as a secondary benefit. The primary
benefit is making it safe to drop the object lock much earlier in fault or
never acquire it at all.
This was tested with a new shadow chain test script that uncovered long
standing bugs and will be integrated with stress2.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22908
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.
Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885
an exclusive object lock.
Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.
Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.
Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.
Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
The handle value is stable for all shadow objects in the inheritance
chain. This allows to avoid descending the shadow chain to get to the
bottom of it in vm_map_entry_set_vnode_text(), and eliminate
corresponding object relocking which appeared to be contending.
Change vm_object_allocate_anon() and vm_object_shadow() to handle more
of the cred/charge initialization for the new shadow object, in
addition to set up the handle.
Reported by: jeff
Reviewed by: alc (previous version), jeff (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differrential revision: https://reviews.freebsd.org/D22541
reudundant complicated checks and additional locking required only for
anonymous memory. Introduce vm_object_allocate_anon() to create these
objects. DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.
Reviewed by: alc, dougm, kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22119
only. Rename it swp_pager_meta_lookup. Stop checking for obj->type
== swap there and assert it instead. Make the caller responsible for
the obj->type check.
Move the meta_ctl 'pop' functionality to swap_pager_unswapped, the
only place that uses it, and assume obj->type == swap there too.
Assisted by: ota_j.email.ne.jp
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22437
exploits the sparsity of allocated blocks in a range, without
issuing an "are you there?" query for every block in the range.
swap_pager_copy() is not so smart. Modify the implementation
of swap_pager_meta_free() slightly so that swap_pager_copy()
can use that smarter implementation too.
Based on an observation of: Yoshihiro Ota (ota_j.email.ne.jp)
Reviewed by: kib,alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22280
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.
Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.
Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.
Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.
The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21456
expression howmany(BBSIZE, PAGE_SIZE), where BBSIZE is the size of the
boot block area. That can be less than 2 if PAGE_SIZE is big.
swapon(8) has an option to trim (delete) all the blocks of a device at
startup. However, if the first of those blocks is a bsd label, then
trimming those blocks is destructive. Change swapon to leave the
first BBSIZE bytes untrimmed.
Update manual pages to reflect changes in how swapon and how it may be
used, espeically in association with savecore.
Reviewed by: alc
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21191
counter, and the final freeing of freed swap blocks, outside the
region where an object lock is held. Correct some style(9) and
spelling errors. Change a panic() to a KASSERT(). Change a boolean_t
to a bool.
Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D21093
comment. Rewrite that comment to improve its clarity.
Reported by: cem
Reviewed by: alc, cem
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20871
after the one where the possible block allocation begins, and allocate
a larger number of blocks than the current limit. This does not affect
the limit on minimum allocation size, which still cannot exceed
BLIST_MAX_ALLOC.
Use this change to modify swp_pager_getswapspace and its callers, so
that they can allocate more than BLIST_MAX_ALLOC blocks if they are
available.
Tested by: pho
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20579
swap_pager_swapoff_object and swp_pager_force_pagein so that they can
page in multiple pages at a time to a swap device, rather than doing
one I/O operation for each page.
Tested by: pho
Submitted by: ota_j.email.ne.jp (Yoshihiro Ota)
Reviewed by: alc, markj, kib
Approved by: kib, markj (mentors)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20635
sw_flags is set to the function argument several lines later.
Reported by: danfe using PVS-studio
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Use it instead of accessing the wire_count field directly. No
functional change intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
the allocation request, so that the blocks allocated are from the next
set of free blocks big enough to satisfy the minimum requirements of
the request, and the number of blocks allocated are as many as
possible, up to the specified maximum. The implementation of
swp_pager_getswapspace uses this parameter to ask for a number of
blocks between the new halved request size and the previous failed
request size. Thus a request for 32 blocks may fail, but instead of
getting only 16 blocks instead, the caller asks for 16 to 31 next, and
might get 19 or 27, which is closer to what they originally wanted.
I expect this to lead to bigger block allocations and less block
fragmentation, at least in some cases.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20001
requested, or none, and in the latter case it is up to them to pick a
smaller request to make - which they always do by halving the failed
request. This change to swp_pager_getswapspace leaves the task of
downsizing the request to the function and not its caller. It still
does so by halving the original request.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20228
i386 is the only architecture where uint64_t does not specify 8-bytes
alignment, which makes struct xswdev layout not compatible between
64bit and i386.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
If user configured the maxswapzone tunable, just take the literal
value for the initial zone sizing attempt. Before, it was only
possible to reduce the zone by the tunable.
While there, correct the message which was not correct when zone
creation rounded the size up.
Reported by: jmg
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D18381
This was introduced in r326329 and explains the crashes mentioned in
the commit log message for r339934. In particular, on INVARIANTS
kernels, UMA trashing causes the loop to exit early, leaving swap
blocks behind when they should have been freed. After r336984 this
became more problematic since new anonymous mappings were more
likely to reuse swapped-out subranges of existing VM objects, so faults
would trigger pageins of freed memory rather than returning zeroed
pages.
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17897
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.
Results in mmap speed up and a 70% increase in brk calls / second.
Reviewed by: alc@, markj@, kib@
Approved by: re (delphij@)
Differential Revision: https://reviews.freebsd.org/D16273
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone. (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)
Reviewed by: kib, markj
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17296
Before swp_pager_meta_build replaces an old swapblk with an new one,
it frees the old one. To allow such freeing of blocks to be
aggregated, have swp_pager_meta_build return the old swap block, and
make the caller responsible for freeing it.
Define a pair of short static functions, swp_pager_init_freerange and
swp_pager_update_freerange, to do the initialization and updating of
blk addresses and counters used in aggregating blocks to be freed.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib, markj (an earlier version)
Tested by: pho
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13707
If fault started before vmspace_fork() locked the map, and then during
fork, vm_map_copy_entry()->vm_object_split() is executed, it is
possible that the fault instantiate the page into the original object
when the page was already copied into the new object (see
vm_map_split() for the orig/new objects terminology). This can happen
if split found a busy page (e.g. from the fault) and slept dropping
the objects lock, which allows the swap pager to instantiate
read-behind pages for the fault. Then the restart of the scan can see
a page in the scanned range, where it was already copied to the upper
object.
Fix it by instantiating the read-ahead pages before
swap_pager_getpages() method drops the lock to allocate pbuf. The
object scan would see the whole range prefilled with the busy pages
and not proceed the range.
Note that vm_fault rechecks the map generation count after the object
unlock, so that it restarts the handling if raced with split, and
re-lookups the right page from the upper object.
In collaboration with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
swp_pager_meta_ctl(), with no opportunity to recognize freeing of
consecutive blocks and free fewer block ranges. To open that opportunity,
this change removes the SWM_FREE option from swp_pager_meta_ctl(), and
compels the caller to do the freeing when a valid block address is returned.
In swap_pager_copy(), these frees are aggregated, so that a sequence of them
can be done at one time.
The only other caller to swp_pager_meta_ctl() that passed SWM_FREE,
swp_pager_unswapped(), is also modified to handle its single free
explicitly.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib (an earlier version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13290
blocks in a single call to blist_alloc(). However, when it frees
that space, it previously called blist_free() on each block, one at a
time. With this change, the swap pager identifies ranges of
contiguous blocks to be freed, and calls blist_free() once per
range. In one extreme case, that is described in the review, the time
to perform an munmap(2) was reduced by 55%.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12397
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Initially, only tag files that use BSD 4-Clause "Original" license.
RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133
similar to the kernel memory allocator.
This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.
A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.
Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
one call to sysctl(2) from jemalloc startup code. (That also requires
changes to jemalloc, but I plan to push those to upstream first.)
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D12745