Commit Graph

4385 Commits

Author SHA1 Message Date
Ryan Libby
7508f15ff1 uma dbg: flexible size for slab debug bitset too
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative.  Now, make the debug bitset
size dynamic too.  The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22759
2019-12-13 09:31:59 +00:00
Mark Johnston
cbc080b4c4 Avoid relying on silent type casting in the native atomic_load_32.
Reported by:	np
2019-12-12 23:55:34 +00:00
Mark Johnston
6fbaf6859c Implement atomic state updates using the new vm_page_astate_t structure.
Introduce primitives vm_page_astate_load() and vm_page_astate_fcmpset()
to operate on the 32-bit per-page atomic state.  Modify
vm_page_pqstate_fcmpset() to use them.  No functional change intended.

Introduce PGA_QUEUE_OP_MASK, a subset of PGA_QUEUE_STATE_MASK that only
includes queue operation flags.  This will be used in subsequent
patches.

Reviewed by:	alc, jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22753
2019-12-12 21:13:20 +00:00
Doug Moore
037c0994bf Extract code common to _vm_map_clip_start and _vm_map_clip_end into a
function, vm_map_entry_clone, that can be invoked by each.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22760
2019-12-11 16:09:57 +00:00
Ryan Libby
6d204a6a0e uma: pretty print zone flags sysctl
Requested by:	jeff
Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22748
2019-12-11 06:50:55 +00:00
Mark Johnston
9be9ea420e Add a helper function to the swapout daemon's deactivation code.
vm_swapout_object_deactivate_pages() is renamed to
vm_swapout_object_deactivate(), and the loop body is moved into the new
vm_swapout_object_deactivate_page().  This makes the code a bit easier
to follow and is in preparation for some functional changes.

Reviewed by:	jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22651
2019-12-10 18:15:20 +00:00
Mark Johnston
5cff1f4dc3 Introduce vm_page_astate.
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state.  The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields.  No functional change intended.

Reviewed by:	alc, jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22650
2019-12-10 18:14:50 +00:00
Doug Moore
7887f00d2b Revert r355505. The code that it allowed to compile has been removed. 2019-12-09 05:09:46 +00:00
Doug Moore
8b75b1ad0d Define a vm_map method for user-space for advancing from a map entry
to its successor in cases where examining a map entry requires a
helper like kvm_read_all.  Use that method, with kvm_read_all, to fix
procstat_getfiles_kvm, which tries to find the successor now without
using such a helper.  This addresses a problem introduced by r355491.

Reviewed by: markj (previous version)
Discussed with: kib
Differential Revision: https://reviews.freebsd.org/D22728
2019-12-08 22:33:51 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Jeff Roberson
3b490537f4 Fix two problems with r355149. The sysctl name collision code assumed that
zones would never be freed.  In the case of tmpfs this was not true.  While
here test for the right bit to disable the keg related sysctls for zones
that don't have kegs.

Reported by:	pho
Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22655
2019-12-08 01:55:23 +00:00
Jeff Roberson
cff8481de4 It is safe to wire a page while the object is busy.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22636
2019-12-08 01:49:53 +00:00
Jeff Roberson
2306558c54 It is now safe to rename a page that is still on a queue. Allowing this
is necessary for a forthcoming patch.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22636
2019-12-08 01:49:03 +00:00
Jeff Roberson
d8ad7b7d6a Do not assert that the object lock is held in vm_object_set_writeable_dirty.
A valid reference is all that is required.  If we race with a deallocation
we will harmlessly misidentify the type of an already dead object.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22636
2019-12-08 01:47:29 +00:00
Jeff Roberson
fb1d575ceb Reduce duplication in grab functions by providing allocflags based inlines.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22635
2019-12-08 01:16:22 +00:00
Jeff Roberson
1e0701e1e5 Use a variant slab structure for offpage zones. This saves space in
embedded slabs but also is an opportunity to tidy up code and add
accessor inlines.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22609
2019-12-08 01:15:06 +00:00
Mark Johnston
c0829bb1d6 Add casts required by the 32-bit build after r355491. 2019-12-08 00:02:36 +00:00
Mark Johnston
3098cd73a3 Provide vm_map_entry traversal routines to userspace.
This is required for now to allow libprocstat to compile.

Discussed with:	dougm
2019-12-07 19:36:40 +00:00
Mateusz Guzik
91caa9b8c1 vm: fix sysctl vm.kstack_cache_size change report
Cache gets resized correctly, but sysctl reports the wrong number:
# sysctl vm.kstack_cache_size=512
vm.kstack_cache_size: 128 -> 128

patched:
vm.kstack_cache_size: 128 -> 512

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22717
Fixes:	r355002 "Revise the page cache size policy."
2019-12-07 17:28:41 +00:00
Doug Moore
c1ad5342a6 Remove the next and prev fields from vm_map_entry, to save a bit of
space.  Where the vm_map tree now has null pointers, store pointers to
next and previous entries in right and left fields, making the binary
tree threaded.  Have the predecessor and successor functions compute
what the prev and next fields previously stored.

Reviewed by: markj, kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D21964
2019-12-07 17:14:33 +00:00
Justin Hibbits
caef3e1280 powerpc/pmap: NUMA-ize vm_page_array on powerpc
Summary:
This matches r351198 from amd64.  This only applies to AIM64 and Book-E.
On AIM64 it short-circuits with one domain, to behave similar to
existing.  Otherwise it will allocate 16MB huge pages to hold the page
array, across all NUMA domains.  On the first domain it will shift the
page array base up, to "upper-align" the page array in that domain, so
as to reduce the number of pages from the next domain appearing in this
domain.  After the first domain, subsequent domains will be allocated in
full 16MB pages, until the final domain, which can be short.  This means
some inner domains may have pages accounted in earlier domains.

On Book-E the page array is setup at MMU bootstrap time so that it's
always mapped in TLB1, on both 32-bit and 64-bit.  This reduces the TLB0
overhead for touching the vm_page_array, which reduces up to one TLB
miss per array access.

Since page_range (vm_page_startup()) is no longer used on Book-E but is on
32-bit AIM, mark the variable as potentially unused, rather than using a
nasty #if defined() list.

Reviewed by:	luporl
Differential Revision:	https://reviews.freebsd.org/D21449
2019-12-07 03:34:03 +00:00
Mark Johnston
a6f21d15dd Fix fault_type handling in vm_map_lookup().
Suppose that the map entry is wired, so that we later assign
fault_type = entry->protection.  Suppose further that we jump back to
RetryLookup.  Then fault_type will no longer contain the original
fault protection mask, but instead that of the wired entry.

Submitted by:	Wuyang Chung <wuyang.chung1@gmail.com>
Reviewed by:	kib
MFC after:	3 days
Github PR:	https://github.com/freebsd/freebsd/pull/419
Differential Revision:	https://reviews.freebsd.org/D22683
2019-12-06 23:39:08 +00:00
Mark Johnston
ed2f945a39 Fix an off-by-one error in vm_map_pmap_enter().
If the starting pindex is equal to object->size, there is nothing to do.
This was harmless since the rest of vm_map_pmap_enter() has no effect
when psize == 0.

Submitted by:	Wuyang Chung <wuyang.chung1@gmail.com>
Reviewed by:	alc, dougm, kib
MFC after:	1 week
Github PR:	https://github.com/freebsd/freebsd/pull/417
Differential Revision:	https://reviews.freebsd.org/D22678
2019-12-04 19:46:48 +00:00
Andrew Turner
b75c4efcd2 Fix the signature for zone_import and zone_release
These are cast to uma_import and uma_release functions. Use the signature
for these in the zone functions.

This was found with an experimental Kernel CFI. It will complain if the
signature is different than what a function pointer expects. The
simplest way to fix these is to correct the signature.

Reviewed by:	rlibby
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22671
2019-12-04 18:40:05 +00:00
Jeff Roberson
9b78b1f433 Use a precise bit count for the slab free items in UMA. This significantly
shrinks embedded slab structures.

Reviewed by:	markj, rlibby (prior version)
Differential Revision:	https://reviews.freebsd.org/D22584
2019-12-02 22:44:34 +00:00
Jeff Roberson
0f9e06e18b Fix a few places that free a page from an object without busy held. This is
tightening constraints on busy as a precursor to lockless page lookup and
should largely be a NOP for these cases.

Reviewed by:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22611
2019-12-02 22:42:05 +00:00
Konstantin Belousov
67388836f3 Store the bottom of the shadow chain in OBJ_ANON object->handle member.
The handle value is stable for all shadow objects in the inheritance
chain.  This allows to avoid descending the shadow chain to get to the
bottom of it in vm_map_entry_set_vnode_text(), and eliminate
corresponding object relocking which appeared to be contending.

Change vm_object_allocate_anon() and vm_object_shadow() to handle more
of the cred/charge initialization for the new shadow object, in
addition to set up the handle.

Reported by:	jeff
Reviewed by:	alc (previous version), jeff (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differrential revision:	https://reviews.freebsd.org/D22541
2019-12-01 20:43:04 +00:00
Jeff Roberson
886b90219a Restore swap space accounting for non-anonymous swap objects. This was
broken in r355082.  Reduce some locking in nearby related object type
checks.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22565
2019-11-29 19:57:49 +00:00
Jeff Roberson
f2410510db Avoid acquiring the object lock if color is already set. It can not be
unset until the object is recycled so this check is stable.  Now that we
can acquire the ref without a lock it is not necessary to group these
operations and we can avoid it entirely in many cases.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22565
2019-11-29 19:49:20 +00:00
Jeff Roberson
26c4e9831b Fix a perf regression from r355122. We can use a shared lock to drop the
last ref on vnodes.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22565
2019-11-29 19:47:40 +00:00
Jeff Roberson
6d6a03d7a8 Handle large mallocs by going directly to kmem. Taking a detour through
UMA does not provide any additional value.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22563
2019-11-29 03:14:10 +00:00
Doug Moore
85b7bedb15 Functions that call vm_map_splay_merge sometimes set data fields
(e.g. root->left = NULL) to affect the behavior of that function. This
change stops that data manipulation, and instead calls a pair of
functions, one for the left direction and the other for the right,
with the function called depending whether or not we currently null
the root child in that direction to control the behavior of
vm_map_splay_merge.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22589
2019-11-29 02:06:45 +00:00
Jeff Roberson
584061b480 Garbage collect the mostly unused us_keg field. Use appropriately named
union members in vm_page.h to store the zone and slab.  Remove some nearby
dead code.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22564
2019-11-28 07:49:25 +00:00
Ryan Libby
35ec24f362 uma: move sysctl vm.uma defn out from under INVARIANTS
Fix non-INVARIANTS builds after r355149.

Reported by:	Michael Butler <imb@protected-networks.net>
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22588
2019-11-28 04:15:16 +00:00
Jeff Roberson
20a4e15451 Implement a sysctl tree for uma zones to assist in debugging and provide
more statistcs than are exported via the ABI stable vmstat interface.
Rename uz_count to uz_bucket_size because even I was confused by the
name after returning to the source years later.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22554
2019-11-28 00:19:09 +00:00
Jeff Roberson
0a81b4395e Refactor uma_zfree_arg into several functions to make control flow more
clear and icache usage cleaner.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22491
2019-11-27 23:19:06 +00:00
Doug Moore
1867d2f2e9 Inline some splay helper functions to improve performance on a
micro-benchmark.

Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22544
2019-11-27 21:00:44 +00:00
Ryan Libby
ca293436d1 uma: trash memory when ctor/dtor supplied too
On INVARIANTS kernels, UMA has a use-after-free detection mechanism.
This mechanism previously required that all of the ctor/dtor/uminit/fini
arguments to uma_zcreate() be NULL in order to function.  Now, it only
requires that uminit and fini be NULL; now, the trash ctor and dtor will
be called in addition to any supplied ctor or dtor.

Also do a little refactoring for readability of the resulting logic.

This enables use-after-free detection for more zones, and will allow for
simplification of some callers that worked around the previous
restriction (see kern_mbuf.c).

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20722
2019-11-27 19:49:55 +00:00
Jeff Roberson
a67d540832 Use atomics in more cases for object references. We now can completely
omit the object lock if we are above a certain threshold.  Hold only a
single vnode reference when the vnode object has any ref > 0.  This
allows us to only lock the object and vnode on 0-1 and 1-0 transitions.

Differential Revision:	https://reviews.freebsd.org/D22452
2019-11-27 00:39:23 +00:00
Jeff Roberson
beb8beef81 Refactor uma_zalloc_arg(). It is a mess of gotos and code which doesn't
make sense after many partial refactors.  Attempt to make a smaller cache
footprint for the fast path.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22470
2019-11-26 22:17:02 +00:00
Ryan Libby
6a14746c01 vm_object_collapse_scan_wait: drop locks before reacquiring
Regression from r352174.  In the vm_page_rename() failure case we forgot
to unlock the vm object locks before sleeping and reacquiring them.

Reviewed by:	jeff
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22542
2019-11-25 07:38:31 +00:00
Jeff Roberson
4d987866e6 Move anonymous object copying for fork into its own routine and so that we
can avoid locking non-anonymous objects.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D22472
2019-11-25 07:13:05 +00:00
Doug Moore
2767c9f36a Where 'current' is used to index over vm_map entries, use
'entry'. Where 'entry' is used to identify the starting point for
iteration, use 'first_entry'. These are the naming conventions used in
most of the vm_map.c code.  Where VM_MAP_ENTRY_FOREACH can be used, do
so. Squeeze a few lines to fit in 80 columns.  Where lines are being
modified for these reasons, look to remove style(9) violations.

Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D22458
2019-11-25 02:19:47 +00:00
Konstantin Belousov
3236244936 Ignore object->handle for OBJ_ANON objects.
Note that the change in vm_object_collapse() is arguably a correctness
fix.  We must not collapse into content-identity carrying objects.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22467
2019-11-24 19:18:12 +00:00
Konstantin Belousov
b631c36f0d Record part of the owner struct thread pointer into busy_lock.
Record as much bits from curthread into busy_lock as fits.  Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0).  Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler.  For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22298
2019-11-24 19:12:23 +00:00
Mark Johnston
9c770a27ce Simplify vm_pageout_init_domain() and add a "big picture" comment.
Stop subtracting 1024/200 from vmd_page_count/200.  I cannot see how
such precise accounting can make a difference on modern systems.

Add some explanation of what the page daemon does and how it handles
memory shortages.

Reviewed by:	dougm
Discussed with:	jeff, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22396
2019-11-22 16:31:43 +00:00
Mark Johnston
8fc2550837 Reclaim memory from UMA if the page daemon is struggling.
Use the UMA reclaim thread to asynchronously drain all caches if
there is a severe shortage in a domain.  Otherwise we only trigger UMA
reclamation every 10s even when the system has completely run out of
memory.

Stop entirely draining the caches when one domain falls below its min
threshold.  In some workloads it is normal for one NUMA domain to end
up being nearly depleted by kernel memory allocations, for example for
the ZFS ARC.  The domainset iterators skip domains below the
vmd_min_free theshold on the first iteration, so we should allow that
mechanism to limit further depletion of the domain's free pages before
taking the extreme step of calling uma_reclaim(UMA_RECLAIM_DRAIN_CPU).

Discussed with:	jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22395
2019-11-22 16:31:30 +00:00
Mark Johnston
bf0d60af92 Update the checks in vm_page_zone_import().
- Remove the cnt == 1 check.  UMA passes cnt == 1 when it has disabled
  per-CPU caching.  In this case we might as well just allocate a single
  page and return it to the caller, since the caller is going to do
  exactly that anyway if the UMA cache allocation attempt fails.
- Don't replenish caches if the domain is severely short on free pages.
  With large buckets we may otherwise quickly exacerbate a situation
  where the page daemon is failing to keep up.
- Don't replenish caches if the calling thread belongs to the page
  daemon, which should avoid creating extra memory pressure when it is
  trying to free memory.  Virtually all such allocations while occur in
  the context of laundering, where the laundry thread must allocate
  slabs for various swap and I/O-related UMA zones.

Reviewed by:	kib
Discussed with:	alc, jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22394
2019-11-22 16:31:10 +00:00
Mark Johnston
003cf08ba9 Revise the page cache size policy.
In r353734 the use of the page caches was limited to systems with a
relatively large amount of RAM per CPU.  This was to mitigate some
issues reported with the system not able to keep up with memory pressure
in cases where it had been able to do so prior to the addition of the
direct free pool cache.  This change re-enables those caches.

The change modifies uma_zone_set_maxcache(), which was introduced
specifically for the page cache zones.  Rather than using it to limit
only the full bucket cache, have it also set uz_count_max to provide an
upper bound on the per-CPU cache size that is consistent with the number
of items requested.  Remove its return value since it has no use.

Enable the page cache zones unconditionally, and limit them to 0.1% of
the domain's pages.  The limit can be overridden by the
vm.pgcache_zone_max tunable as before.

Change the item size parameter passed to uma_zcache_create() to the
correct size, and stop setting UMA_ZONE_MAXBUCKET.  This allows the page
cache buckets to be adaptively sized, like the rest of UMA's caches.
This also causes the initial bucket size to be small, so only systems
which benefit from large caches will get them.

Reviewed by:	gallatin, jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22393
2019-11-22 16:30:47 +00:00
Mark Johnston
b378d29687 Fix locking in vm_reserv_reclaim_contig().
We were not properly handling the case where the trylock of the
reservaton fails, in which case we could leak reservation lock.

Introduce a marker reservation to implement precise scanning in
vm_reserv_reclaim_contig().  Before, a race could result in early
termination of the scan in rare situations.  Use the marker's lock to
serialize scans of the partpop queue so that a global marker structure
can be used.  Modify vm_reserv_reclaim_inactive() to handle the presence
of a marker while minimizing the hold time of domain-global locks.

Reviewed by:	alc, jeff, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22392
2019-11-22 16:28:52 +00:00
Andrew Turner
09a65f9ff5 As with r354905 use uint16_t to store aflags on the stack and as function
arguments as the aflags size in vm_page_t has increased.

Sponsored by:	DARPA, AFRL
2019-11-20 18:00:43 +00:00
Andrew Turner
ad216bc10d Use atomic_load_16 to load aflags as it's a uint16_t after r354820.
Sponsored by:	DARPA, AFRL
2019-11-20 17:49:58 +00:00
Doug Moore
83704cc236 Instead of looking up a predecessor or successor to the current map
entry, when that entry has been seen already, keep the
already-looked-up value in a variable and use that instead of looking
it up again.

Approved by: alc, markj (earlier version), kib (earlier version)
Differential Revision: https://reviews.freebsd.org/D22348
2019-11-20 16:06:48 +00:00
Jeff Roberson
71353f7a2f When we set OFFPAGE to limit fragmentation we should also set VTOSLAB
so that we avoid the hashtables.  The hashtable is now only required if
a zone is created with OFFPAGE specified initially, not internally.  This
flag signals to UMA that it can't touch the allocated memory and so
can't store a slab pointer in the containing page.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22453
2019-11-20 01:57:33 +00:00
Jeff Roberson
51b867e56b Only keep anonymous objects on shadow lists. This eliminates locking of
globally visible objects when they are part of a backing chain.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22423
2019-11-20 00:31:14 +00:00
Jeff Roberson
7f935055d3 Remove unnecessary object locking from the vnode pager. Recent changes to
busy/valid/dirty locking make these acquires redundant.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22186
2019-11-19 23:30:09 +00:00
Jeff Roberson
639676877b Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates
reudundant complicated checks and additional locking required only for
anonymous memory.  Introduce vm_object_allocate_anon() to create these
objects.  DEFAULT and SWAP objects now have the correct settings for
non-anonymous consumers and so individual consumers need not modify the
default flags to create super-pages and avoid ONEMAPPING/NOSPLIT.

Reviewed by:	alc, dougm, kib, markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22119
2019-11-19 23:19:43 +00:00
Doug Moore
8ecbf14b74 Drop the extra argument from swp_pager_meta_ctl and have it do lookup
only.  Rename it swp_pager_meta_lookup.  Stop checking for obj->type
== swap there and assert it instead.  Make the caller responsible for
the obj->type check.

Move the meta_ctl 'pop' functionality to swap_pager_unswapped, the
only place that uses it, and assume obj->type == swap there too.

Assisted by: ota_j.email.ne.jp
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22437
2019-11-19 08:06:31 +00:00
Mark Johnston
fe6d5344c2 Group per-domain reservation data in the same structure.
We currently have the per-domain partially populated reservation queues
and the per-domain queue locks.  Define a new per-domain padded
structure to contain both of them.  This puts the queue fields and lock
in the same cache line and avoids the false sharing within the old queue
array.

Also fix field packing in the reservation structure.  In many places we
assume that a domain index fits in 8 bits, so we can do the same there
as well.  This reduces the size of the structure by 8 bytes.

Update some comments while here.  No functional change intended.

Reviewed by:	dougm, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22391
2019-11-18 18:25:51 +00:00
Mark Johnston
3a2ba9974d Widen the vm_page aflags field to 16 bits.
We are now out of aflags bits, whereas the "flags" field only makes use
of five of its sixteen bits, so narrow "flags" to eight bits.  I have no
intention of adding a new aflag in the near future, but would like to
combine the aflags, queue and act_count fields into a single atomically
updated word.  This will allow vm_page_pqstate_cmpset() to become much
simpler and is a step towards eliminating the use of the page lock array
in updating per-page queue state.

The change modifies the layout of struct vm_page, so bump
__FreeBSD_version.

Reviewed by:	alc, dougm, jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22397
2019-11-18 18:22:41 +00:00
Doug Moore
abdab7b633 Add a helper function for testing a swap block and freeing it if empty.
Submitted by: ota_j.email.ne.jp
Approved by: alc, kib, dougm
Differential Revision: https://reviews.freebsd.org/D22402
2019-11-17 18:38:37 +00:00
Konstantin Belousov
156e865494 Add elf image flag to disable stack gap.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22379
2019-11-17 14:54:07 +00:00
Doug Moore
bdb90e7613 The loop in vm_map_protect that verifies that all transition map
entries are stabilized, repeatedly verifies the same entry. Check each
entry in turn.

Reviewed by: kib (code only), alc
Tested by: pho
MFC after: 7 days
Differential Revision: https://reviews.freebsd.org/D22405
2019-11-17 06:50:36 +00:00
Doug Moore
7cdcf86360 Define wrapper functions vm_map_entry_{succ,pred} to act as wrappers
around entry->{next,prev} when those are used for ordered list
traversal, and use those wrapper functions everywhere. Where the next
field is used for maintaining a stack of deferred operations, #define
defer_next to make that different usage clearer, and then use the
'right' pointer instead of 'next' for that purpose.

Approved by: markj
Tested by: pho (as part of a larger patch)
Differential Revision: https://reviews.freebsd.org/D22347
2019-11-13 15:56:07 +00:00
Doug Moore
467057fcd9 swap_pager_meta_free() frees allocated blocks in a way that
exploits the sparsity of allocated blocks in a range, without
issuing an "are you there?" query for every block in the range.
swap_pager_copy() is not so smart.  Modify the implementation
of swap_pager_meta_free() slightly so that swap_pager_copy()
can use that smarter implementation too.

Based on an observation of: Yoshihiro Ota (ota_j.email.ne.jp)
Reviewed by: kib,alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22280
2019-11-11 16:59:49 +00:00
Konstantin Belousov
08034d1006 Include cache zones into zone_foreach() where appropriate.
The r354367 is reverted since it is subsumed by this, more complete, approach.

Suggested by:	markj
Reviewed by:	alc. glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22242
2019-11-10 09:25:19 +00:00
Doug Moore
461587dc9b For vm_map, #defining DIAGNOSTIC to turn on full assertion-based
consistency checking slows performance dramatically. This change
reduces the number of assertions checked by completely walking the
vm_map tree only when the write-lock is released, and only then if the
number of modifications to the tree since the last walk exceeds the
number of tree nodes.

Reviewed by: alc, kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22163
2019-11-09 17:08:27 +00:00
Mark Johnston
c95f8ed885 Drop Giant before sleeping on a busy page.
Before the page busy code was converted to make direct use of
sleepqueues, this was handled by _sleep().

Reported by:	glebius
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
2019-11-07 18:26:29 +00:00
Mark Johnston
be801aaaef Fix a race in release_page().
Since r354156 we may call release_page() without the page's object lock
held, specifically following the page copy during a CoW fault.
release_page() must therefore unbusy the page only after scheduling the
requeue, to avoid racing with a free of the page.  Previously, the
object lock prevented this race from occurring.

Add some assertions that were helpful in tracking this down.

Reported by:	pho, syzkaller
Tested by:	pho
Reviewed by:	alc, jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22234
2019-11-06 16:59:16 +00:00
Konstantin Belousov
432fc36da1 Switch cache zones from early counters to real implementation.
Early counter mock can be only used on BSP for amd64, when APs try to
update it that causes random memory corruption.

N.B.  This is a temporary patch to plug the corruption for now, while
a proper solution for handling cache zones in zone_foreach() is being
developed.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation, Mellanox Technologies
2019-11-05 21:38:48 +00:00
Konstantin Belousov
afa7e88a18 vm_page_wire_mapped: explain why failure does not affect correctness.
Reviewed by:	markj (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22196
2019-10-30 17:33:17 +00:00
Jeff Roberson
67d0e29304 Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22116
2019-10-29 21:06:34 +00:00
Jeff Roberson
51df53213c Use atomics and a shared object lock to protect the object reference count.
Certain consumers still need to guarantee a stable reference so we can not
switch entirely to atomics yet.  Exclusive lock holders can still modify
and examine the refcount without using the ref api.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21598
2019-10-29 20:58:46 +00:00
Jeff Roberson
4b3e066539 Drop the object lock earlier in fault and don't relock it after pmap_enter().
Recent changes in object and page locking have enabled more lock pushdown.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22036
2019-10-29 20:46:25 +00:00
Gleb Smirnoff
42a621624d Add couple more assertions to vm_pager_assert_in(). The bogus page is
not allowed at ends of the request, and all non-bogus pages must be
consecutive.

Reviewed by:	kib
2019-10-25 16:59:54 +00:00
Andrew Gallatin
0dc59d7632 Add a tunable to set the pgcache zone's maxcache
When it is set to 0 (the default), a heavy Netflix-style web workload
suffers from heavy lock contention on the vm page free queue called from
vm_page_zone_{import,release}() as the buckets are frequently drained.
When setting the maxcache, this contention goes away.

We should eventually try to autotune this, as well as make this
zone eligable for uma_reclaim().

Reviewed by:	alc, markj
Not Objected to by: jeff
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22112
2019-10-24 18:39:05 +00:00
Mark Johnston
be2c561003 Modify release_page() to handle a missing fault page.
r353890 introduced a case where we may call release_page() with
fs.m == NULL, since the fault handler may now lock the vnode prior
to allocating a page for a page-in.

Reported by:	jhb
Reviewed by:	kib
MFC with:	r353890
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22120
2019-10-23 20:39:21 +00:00
Mark Johnston
2f81c92e55 Check for bogus_page in vnode_pager_generic_getpages_done().
We now assert that a page is busy when updating its validity-tracking
state, but bogus_page is not busied during a getpages operation.

Reported by:	syzkaller
Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22124
2019-10-23 18:00:22 +00:00
Mark Johnston
e6f1a58082 Verify identity after checking for WAITFAIL in vm_page_busy_acquire().
A caller that does not guarantee that a page's identity won't change
while sleeping for a busy lock must specify either NOWAIT or WAITFAIL.

Reported by:	syzkaller
Reviewed by:	alc, kib
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22124
2019-10-23 17:58:19 +00:00
Konstantin Belousov
16b0c09225 Assert that vm_fault_lock_vnode() returns locked saved vnode.
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D22113
2019-10-23 07:36:26 +00:00
Konstantin Belousov
5b87ecc643 Assert that vnode_pager_setsize() is called with the vnode exclusively locked
except for filesystems that set the MNTK_VMSETSIZE_BUG,  Set the flag for ZFS.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D21883
2019-10-22 16:21:24 +00:00
Konstantin Belousov
208b81bb05 Add VV_VMSIZEVNLOCK flag.
The flag specifies that vm_fault() handler should check the vnode'
vm_object size under the vnode lock.  It is converted into the object'
OBJ_SIZEVNLOCK flag in vnode_pager_alloc().

Tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D21883
2019-10-22 16:09:25 +00:00
Konstantin Belousov
0ddd3082a4 vm_fault(): extract code to lock the vnode into a helper vn_fault_lock_vnode().
Tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D21883
2019-10-22 15:59:16 +00:00
Mark Johnston
1de9724e55 Avoid reloading bucket pointers in uma_vm_zone_stats().
The correctness of per-CPU cache accounting in that function is
dependent on reading per-CPU pointers exactly once.  Ensure that
the compiler does not emit multiple loads of those pointers.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22081
2019-10-22 14:20:06 +00:00
Mark Johnston
d307bdcc2c Further constrain the use of per-CPU caches for free pages.
In low memory conditions a significant number of pages may end up stuck
in the caches, and currently these caches cannot be reaped, leading to
spurious memory allocation failures and OOM kills.  So:

- Take into account the fact that we may cache up to two full buckets
  of pages per CPU, not just one.
- Increase the amount of RAM required per CPU to enable the caches.

This is a temporary measure until the page cache management policy is
improved.

PR:		241048
Reported and tested by:	Kevin Oberman <rkoberman@gmail.com>
Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22040
2019-10-18 17:36:42 +00:00
Mark Johnston
f822c9e287 Apply mapping protections to preloaded kernel modules on amd64.
With an upcoming change the amd64 kernel will map preloaded files RW
instead of RWX, so the kernel linker must adjust protections
appropriately using pmap_change_prot().

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21860
2019-10-18 13:56:45 +00:00
Konstantin Belousov
303fa05a1f swapon_check_swzone(): use already calculated static variables.
Submitted by:	ota@j.email.ne.jp
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22065
2019-10-17 13:49:47 +00:00
Mark Johnston
01cef4caa7 Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore().  Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21823
2019-10-16 22:03:27 +00:00
Mark Johnston
d0c9294b81 Correct the range boundaries used by kern_mincore().
Reported by:	alc
Sponsored by:	Netflix
2019-10-16 21:47:58 +00:00
Jeff Roberson
fff5403f84 (5/6) Move the VPO_NOSYNC to PGA_NOSYNC to eliminate the dependency on the
object lock in vm_page_set_validclean().

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21595
2019-10-15 03:48:22 +00:00
Jeff Roberson
0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Jeff Roberson
205be21d99 (3/6) Add a shared object busy synchronization mechanism that blocks new page
busy acquires while held.

This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held.  This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.

Reviewed by:    kib
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21592
2019-10-15 03:41:36 +00:00
Jeff Roberson
8da1c09853 (2/6) Don't release xbusy in vm_page_remove(), defer to vm_page_free_prep().
This persists busy state across operations like rename and replace.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:  https://reviews.freebsd.org/D21549
2019-10-15 03:38:02 +00:00
Jeff Roberson
63e9755548 (1/6) Replace busy checks with acquires where it is trival to do so.
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21548
2019-10-15 03:35:11 +00:00
Doug Moore
32731f2eb1 Correct a transcription error that broke GENERIC introduced in r353496. 2019-10-14 17:51:57 +00:00
Doug Moore
721899b1f1 Move the definition of _vm_map_assert_consistent so that it can use
vm_map_free_{left,right} rather than re-implementing them.  Use the
VM_MAP_FOREACH macro where applicable.  Fix some indentation.

Suggested by: kib (in a comment on D21964)
Tested by: pho (as part of D21964)
Differential Revision: https://reviews.freebsd.org/D22011
2019-10-14 17:15:42 +00:00
Leandro Lupori
0ecc478b74 [PPC64] Initial kernel minidump implementation
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).

Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D21551
2019-10-14 13:04:04 +00:00
Konstantin Belousov
c31cec4552 Restore nofaulting operations after r352807
The TDP_NOFAULTING flag should be checked in vm_fault(), not in
vm_fault_trap().  Otherwise kernel accesses to userspace, like
vn_io_fault(), enter vm locking when it should not.

Reported and tested by:	pho
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D21992
2019-10-13 06:56:45 +00:00
Conrad Meyer
0223790f8d Fix braino in r353429
cy@ points out that I got parameter order backwards between definition and
invocation of the helper function.  He is totally correct.  The earlier
version of this patch predated the XFree column so this is one I introduced,
rather than the original author.

Submitted by:	cy
Reported by:	cy
X-MFC-With:	r353429
2019-10-11 06:02:03 +00:00
Conrad Meyer
46d70077be ddb: Add CSV option, sorting to 'show (malloc|uma)'
Add /i option for machine-parseable CSV output.  This allows ready copy/
pasting into more sophisticated tooling outside of DDB.

Add total zone size ("Memory Use") as a new column for UMA.

For both, sort the displayed list on size (print the largest zones/types
first).  This is handy for quickly diagnosing "where has my memory gone?" at
a high level.

Submitted by:	Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version)
Sponsored by:	Dell EMC Isilon
2019-10-11 01:31:31 +00:00
Doug Moore
2288078c5e Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21882
2019-10-08 07:14:21 +00:00
Mark Johnston
4090e2170d Assert that the PGA_{WRITEABLE,EXECUTABLE} flags do not leak.
Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21783
2019-10-07 23:31:17 +00:00
Mateusz Guzik
7b1fbc424a vm: stop trylocking page queues in vm_page_pqbatch_submit
About 11 minutes of poudriere -s -j 104 and probing on return value of
trylocks reveals that over 10% of attempts fail, which in turn means
there are more atomics performed than necessary.

Trylocking was there to try preventing migration, but it's not very likely
to happen if the lock is uncontested.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21925
2019-10-07 23:19:09 +00:00
Konstantin Belousov
df08823d07 Improve MD page fault handlers.
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap().  MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.

This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.

Change the delivered signal for accesses to mapped area after the
backing object was truncated.  According to POSIX description for
mmap(2):
   The system shall always zero-fill any partial page at the end of an
   object. Further, the system shall never write out any modified
   portions of the last page of an object which are beyond its
   end. References within the address range starting at pa and
   continuing for len bytes to whole pages following the end of an
   object shall result in delivery of a SIGBUS signal.

   An implementation may generate SIGBUS signals when a reference
   would cause an error in the mapped object, such as out-of-space
   condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.

For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered.  Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.

vm_fault_hold() is renamed to vm_fault().  Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos.  Removed
unneeded truncations of the fault addresses reported by hardware.

PR:	211924
Reviewed by:	alc
Discussed with:	jilles, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21566
2019-09-27 18:43:36 +00:00
Mark Johnston
7cc833c598 Fix a race in vm_page_swapqueue().
vm_page_swapqueue() atomically transitions a page between queues.  To do
so, it must hold the page queue lock for the old queue.  However, once
the queue index has been updated, the queue lock no longer protects the
page's queue state.  Thus, we must speculatively remove the page from
the old queue before committing the queue state update, and roll back if
the update fails.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21791
2019-09-27 16:46:08 +00:00
Mark Johnston
87e93ea6c3 Fix object locking in vm_object_unwire() after r352174.
Now, vm_page_busy_sleep() expects the page's object to be locked.
vm_object_unwire() does some unusual lazy locking of the object chain
and keeps objects locked until a busy page is encountered or the loop
terminates.  When a busy page is encountered, rather than unlocking all
but the "bottom-level" object, we must instead skip the object to which
"tm" belongs.

Reported and tested by:	pho
Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21790
2019-09-27 16:41:34 +00:00
Mark Johnston
2b93f779d2 Add some counters for per-VM page events.
For now, just count batched page queue state operations.
vm.stats.page.queue_ops counts the number of batch entries that
successfully completed, while queue_nops counts entries that had no
effect, which occurs when the queue operation had been completed before
the batch entry was processed.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21782
2019-09-25 17:08:35 +00:00
Mark Johnston
b119329d81 Complete the removal of the "wire_count" field from struct vm_page.
Convert all remaining references to that field to "ref_count" and update
comments accordingly.  No functional change intended.

Reviewed by:	alc, kib
Sponsored by:	Intel, Netflix
Differential Revision:	https://reviews.freebsd.org/D21768
2019-09-25 16:11:35 +00:00
Allan Jude
4a9c211af5 sys/vm/vm_glue.c: Incorrect function name in panic string
Use __func__ to avoid this issue in the future.

Submitted by:	Wuyang Chung <wuyang.chung1@gmail.com>
Reviewed by:	markj, emaste
Obtained from:	https://github.com/freebsd/freebsd/pull/410
2019-09-19 07:28:24 +00:00
Doug Moore
1399b98ebd Remove dead code from vm_map_unlink_entry made dead by r351476, and also
a no-longer-used enumerant.

Reviewed by: alc
Approved by: markj (mentor, implicit)
Tested by: pho (as part of a larger change)
Differential Revision: https://reviews.freebsd.org/D21668
2019-09-17 02:53:59 +00:00
Mateusz Guzik
a8c8e44bf0 vfs: manage mnt_ref with atomics
New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.

mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.

Reviewed by:	kib, jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21425
2019-09-16 21:31:02 +00:00
Mark Johnston
38547d5980 Assert that the refcount value is not VPRC_BLOCKED in vm_page_drop().
VPRC_BLOCKED is a refcount flag used to indicate that a thread is
tearing down mappings of a page.  When set, it causes attempts to wire a
page via a pmap lookup to fail.  It should never represent the last
reference to a page, so assert this.

Suggested by:	kib
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:16:48 +00:00
Mark Johnston
923da43e7c Fix a race in vm_page_dequeue_deferred_free() after r352110.
This function loaded the page's queue index before setting PGA_DEQUEUE.
In this window the page daemon may have deactivated the page, updating
its queue index.  Make the operation atomic using vm_page_pqstate_cmpset();
the page daemon will not modify the page once it observes that PGA_DEQUEUE
is set.

Reported and tested by:	pho
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:12:49 +00:00
Mark Johnston
47aef898ea Fix a page leak in vm_page_reclaim_run().
After r352110 the attempt to remove mappings of the page being replaced
may fail if the page is wired.  In this case we must free the replacement
page.

Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:09:31 +00:00
Mark Johnston
e8bcf6966b Revert r352406, which contained changes I didn't intend to commit. 2019-09-16 15:04:45 +00:00
Mark Johnston
41fd4b9422 Fix a couple of nits in r352110.
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:03:12 +00:00
Hans Petter Selasky
11b57401e6 Use REFCOUNT_COUNT() to obtain refcount where appropriate.
Refcount waiting will set some flag bits in the refcount value.
Make sure these bits get cleared by using the REFCOUNT_COUNT()
macro to obtain the actual refcount.

Differential Revision:	https://reviews.freebsd.org/D21620
Reviewed by:	kib@, markj@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-09-12 16:26:59 +00:00
Jeff Roberson
c75757481f Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
 - vm_page_grab_valid() will grab and page in if necessary.
 - vm_page_busy_acquire() automates some busy acquire loops.

Discussed with:	alc, kib, markj
Tested by:	pho (part of larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21546
2019-09-10 19:08:01 +00:00
Jeff Roberson
4cdea4a853 Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state.  The object lock is still used as an interlock
to ensure that the identity stays valid.  Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by:	alc, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21255
2019-09-10 18:27:45 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Konstantin Belousov
0e79619e1e vm_object_deallocate(): Remove no longer needed code.
We track text mappings explicitly, there is no removal of the text
refs on the object deallocate any more, so tmpfs objects should not be
treated specially. Doing so causes excess deref.

Reported and tested by:	gallatin
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21560
2019-09-07 16:01:45 +00:00
Konstantin Belousov
4f49b435c8 vm_object_coalesce(): avoid extending any nosplit objects, not only
that which back tmpfs nodes.

Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21560
2019-09-07 15:58:48 +00:00
Konstantin Belousov
bf5661f4a1 madvise(MADV_FREE): Quick fix to time rewind.
Don't free pages in a shadowing object.  While this degrades MADV_FREE
to a no-op (and we could, instead, choose to fall back to
MADV_DONTNEED, at the cost of changing pmap_madvise), this is
presently considered a temporary fix. We may prefer to risk a little
fragmentation of the map by creating a zero/OBJT_DEFAULT entry over
top of the existing object and, simultaneously, revert to the existing
marking any pages in the former shadowing object in the advised region
as reclaimable.  At least one consumer of MADV_FREE (snmalloc) may use
mmap() to construct zeroed pages "eventually" here anyway, so the
fragmentation may be coming anyway.

Submitted by:	Nathaniel Filardo <nwf20@cl.cam.ac.uk>
PR:	240061
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21517
2019-09-04 20:28:16 +00:00
Kyle Evans
fe7bcbaf50 vm pager: writemapping accounting for OBJT_SWAP
Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.

Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.

The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21456
2019-09-03 20:31:48 +00:00
Konstantin Belousov
fe69291ff4 Add procctl(PROC_STACKGAP_CTL)
It allows a process to request that stack gap was not applied to its
stacks, retroactively.  Also it is possible to control the gaps in the
process after exec.

PR:	239894
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21352
2019-09-03 18:56:25 +00:00
Mark Johnston
7cdeaf3309 Add preliminary support for atomic updates of per-page queue state.
Queue operations on a page use the page lock when updating the page to
reflect the desired queue state, and the page queue lock when physically
enqueuing or dequeuing a page.  Multiple pages share a given page lock,
but queue state is per-page; this false sharing results in heavy lock
contention.

Take a small step towards the use of atomic_cmpset to synchronize
updates to per-page queue state by introducing vm_page_pqstate_cmpset()
and using it in the page daemon.  In the longer term the plan is to stop
using the page lock to protect page identity and rely only on the object
and page busy locks.  However, since the page daemon avoids acquiring
the object lock except when necessary, some synchronization with a
concurrent free of the page is required.  vm_page_pqstate_cmpset() can
be used to ensure that queue state updates are successful only if the
page is not scheduled for a dequeue, which is sufficient for the page
daemon.

Add vm_page_swapqueue(), which moves a page from one queue to another
using vm_page_pqstate_cmpset().  Use it in the active queue scan, which
does not use the object lock.  Modify vm_page_dequeue_deferred() to
use vm_page_pqstate_cmpset() as well.

Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21257
2019-09-03 14:29:58 +00:00
Mark Johnston
9d75f0dc75 Map the vm_page array into KVA on amd64.
r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose.  One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.

To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.

Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21491
2019-09-03 13:18:51 +00:00
Mark Johnston
08cfa56ea3 Extend uma_reclaim() to permit different reclamation targets.
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure.  This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets.  Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure.  In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim().  When trimming, only
items in excess of the estimate are reclaimed.  Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them.  As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by:	jeff
Tested by:	pho (an earlier version)
MFC after:	2 months
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D16667
2019-09-01 22:22:43 +00:00
Konstantin Belousov
6470c8d3db Rework v_object lifecycle for vnodes.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode.  Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM().  This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation.  Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21412
2019-08-29 07:50:25 +00:00
Mateusz Guzik
2614bd9699 vm: only lock tmpfs vnode shared in vm_object_deallocate
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21455
2019-08-28 19:28:27 +00:00
Mark Johnston
b5d239cb97 Wire pages in vm_page_grab() when appropriate.
uiomove_object_page() and exec_map_first_page() would previously wire a
page after having grabbed it.  Ask vm_page_grab() to perform the wiring
instead: this removes some redundant code, and is cheaper in the case
where the requested page is not resident since the page allocator can be
asked to initialize the page as wired, whereas a separate vm_page_wire()
call requires the page lock.

In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of
vm_page_unwire(PQ_NONE).  The latter ensures that the page is dequeued
before returning, but this is unnecessary since vm_page_free() will
trigger a batched dequeue of the page.

Reviewed by:	alc, kib
Tested by:	pho (part of a larger patch)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21440
2019-08-28 16:08:06 +00:00
Mark Johnston
a70e17eeca Fix a few nits in vm_pqbatch_process_page().
- Don't bother masking off non-queue state flags when loading the
  page's atomic state, since it is only required for one of the
  function's assertions.  Update the assertion instead.
- Remove an incorrect comment regarding synchronization with the
  page daemon.  The page daemon only ever checks for PGA_ENQUEUED
  with the page queue lock held.
- When clearing requeue flags, only clear the flags that have been
  acted upon.

Reviewed by:	kib (previous version)
Discussed with:	alc
Tested by:	pho (part of a larger patch)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21368
2019-08-26 20:20:10 +00:00
Mark Johnston
b48d4efe75 Handle UMA_ANYDOMAIN in kstack_import().
The kernel thread stack zone performs first-touch allocations by
default, and must handle the case where the local memory domain
is empty.  For most UMA zones this is handled in the keg layer,
but cache zones currently must implement a policy for this case.
Simply use a round-robin policy if UMA_ANYDOMAIN is passed.

Reported and tested by:	bcran
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
2019-08-25 21:14:46 +00:00
Konstantin Belousov
783a68aa33 Move OBJT_VNODE specific code from vm_object_terminate() to
vnode_destroy_vobject().

Reviewed by:	alc, jeff (previous version), markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21357
2019-08-25 13:26:06 +00:00
Doug Moore
83ea714f4f vm_map_simplify_entry considers merging an entry with its two
neighbors, and is used in a way so that if entries a and b cannot be
merged, we consider them twice, first not-merging a with its successor
b, and then not-merging b with its predecessor a. This change replaces
vm_map_simplify_entry with vm_map_try_merge_entries, which compares
two adjacent entries only, and uses it to avoid duplicated
merge-checks.

Tested by: pho
Reviewed by: alc
Approved by: markj (implicit)
Differential Revision: https://reviews.freebsd.org/D20814
2019-08-25 07:06:51 +00:00
Konstantin Belousov
a7751d328a Make stack grow use the same gap as stack create.
Store stack_guard_page * PAGE_SIZE into the gap->next_read field at
the time of the stack creation.  This makes the used guard size
consistent between stack creation and stack grow time.

Suggested by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21384
2019-08-24 14:29:13 +00:00
Mateusz Guzik
5b596b9fa5 Remove the obsolete pcpu_zone_ptr zone.
It was only used by flowtable (removed in r321618).

Sponsored by:	The FreeBSD Foundation
2019-08-24 00:01:19 +00:00
Mark Johnston
f93670b7b9 Stop clearing page flags in vm_page_pqbatch_submit().
All existing callers guarantee that the page does not have a
pre-existing dequeue pending.  Thus, if the page is dequeued before
pqbatch_submit() acquires the page queue lock, we do not need to do
anything since vm_page_dequeue_complete() takes care of clearing all
page queue state flags for us.

With this change, vm_page_pqbatch_submit() has the nice property that it
does not directly modify any fields in the page structure.

Reviewed by:	alc, kib
Tested by:	pho (part of a larger change)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21372
2019-08-23 19:53:11 +00:00
Mark Johnston
386eba08bd Make vm_pqbatch_submit_page() externally visible.
It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure.  Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent.  No functional change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21369
2019-08-23 19:49:29 +00:00
Mark Johnston
8b90607f20 Simplify vm_page_dequeue() and fix an assertion.
- Add a vm_pagequeue_remove() function to physically remove a page
  from its queue and update the queue length.
- Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue()
  return NULL for dequeued pages.
- Avoid unnecessarily reloading the queue index if vm_page_dequeue()
  loses a race with a concurrent queue operation.
- Correct an always-true assertion: vm_page_dequeue() may be called
  from the page allocator with the page unlocked.  The assertion
  m->order == VM_NFREEORDER simply tests whether the page has been
  removed from the vm_phys free lists; instead, check whether the
  page belongs to an object.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21341
2019-08-21 16:11:12 +00:00
Mark Johnston
acad79e66f Unconditionally enable debug.vm_lowmem.
It is useful for testing purposes to be able to drain UMA caches, so
do not limit the sysctl to DIAGNOSTIC kernels.

MFC after:	1 week
Sponsored by:	Netflix
2019-08-21 16:01:17 +00:00
Mark Johnston
930b195263 Don't requeue active pages in vm_swapout_object_deactivate_pages().
As of r332974 the page daemon does not requeue pages during a scan
of the active queue, so there is not much value in doing so here
either.

Reviewed by:	alc, dougm, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21343
2019-08-21 15:52:10 +00:00
Jeff Roberson
cf27e0d125 Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by:	markj
Tested by:	pho (as part of a larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21311
2019-08-19 23:09:38 +00:00
Jeff Roberson
4153054a7c Permit vm_pager_has_page() to run with a shared lock. Introduce
VM_OBJECT_DROP/VM_OBJECT_PICKUP to handle functions that are called with
uncertain lock state.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21310
2019-08-19 22:25:28 +00:00
Jeff Roberson
3e5e1b5135 Allocate amd64's page array using pages and page directory pages from the
NUMA domain that the pages describe.  Patch original from gallatin.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21252
2019-08-18 23:07:56 +00:00
Konstantin Belousov
bb9e2184f0 Change locking requirements for VOP_UNSET_TEXT().
Require the vnode to be locked for the VOP_UNSET_TEXT() call.  This
will be used by the following bug fix for a tmpfs issue.

Tested by:	sbruno, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-18 20:24:52 +00:00
Jeff Roberson
3921068f1e Remove unnecessary debugging from r351181 that caused powerpc build to fail.
Tested by:	make universe TARGETS=powerpc
2019-08-18 08:07:31 +00:00
Jeff Roberson
be3f5f298b vm_phys_avail_find is only used on NUMA kernels. Fix a build error. 2019-08-18 07:43:15 +00:00
Jeff Roberson
b7565d44df Encapsulate phys_avail manipulation in a set of simple routines. Add a
NUMA aware boot time memory allocator that will be used to allocate early
domain correct structures.  Code partially submitted by gallatin.

Reviewed by:	gallatin, kib
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21251
2019-08-18 07:06:31 +00:00
Aleksandr Rybalko
6b821a7455 Check paddr for overflow.
Fix panic on initialize of "vm reserv" per-superpage lock in case when RAM ends at upper boundary of address space.
Observed on ARM32 board BPI-R2 (2GB RAM 0x80000000-0xffffffff).

PR:		235362
Reviewed by:	kib, markj, alc
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D21272
2019-08-16 19:27:05 +00:00
Konstantin Belousov
245139c69d Fix OOM handling of some corner cases.
In addition to pagedaemon initiating OOM, also do it from the
vm_fault() internals.  Namely, if the thread waits for a free page to
satisfy page fault some preconfigured amount of time, trigger OOM.
These triggers are rate-limited, due to a usual case of several
threads of the same multi-threaded process to enter fault handler
simultaneously.  The faults from pagedaemon threads participate in the
calculation of OOM rate, but are not under the limit.

Reviewed by:	markj (previous version)
Tested by:	pho
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13671
2019-08-16 09:43:49 +00:00
Jeff Roberson
2194393787 Move phys_avail definition into MI code. It is consumed in the MI layer and
doing so adds more flexibility with less redundant code.

Reviewed by:	jhb, markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21250
2019-08-16 00:45:14 +00:00
Doug Moore
504f5e294e swap_pager.c reserves 2 blocks for a bsd label. Change that 2 to the
expression howmany(BBSIZE, PAGE_SIZE), where BBSIZE is the size of the
boot block area.  That can be less than 2 if PAGE_SIZE is big.

swapon(8) has an option to trim (delete) all the blocks of a device at
startup.  However, if the first of those blocks is a bsd label, then
trimming those blocks is destructive.  Change swapon to leave the
first BBSIZE bytes untrimmed.

Update manual pages to reflect changes in how swapon and how it may be
used, espeically in association with savecore.

Reviewed by: alc
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21191
2019-08-15 02:30:44 +00:00
Konstantin Belousov
10ae16c7fe Fix stack grow for init.
During early stages of kern_exec(), including strings copyout,
p_textvp for init is NULL.  This prevented stack grow from working for
init execution.

Without stack gap enabled, initial stack segment size is enough for
strings passed by kernel to init.  With the gap enabled, the used
address might fall out of the initial segment, which kills init.

Exclude initproc from the check for contexts which should not cause
stack grow in the target map.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-08 16:48:19 +00:00
Jeff Roberson
0b26119b21 Cache kernel stacks in UMA. This gives us NUMA support, better concurrency,
and more statistics.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20931
2019-08-06 23:15:34 +00:00
Jeff Roberson
eda1b01647 Implement a MINBUCKET zone flag so we can use minimal caching on zones that
may be expensive to cache.

Reviewed by:	markj, kib
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D20930
2019-08-06 23:04:59 +00:00
Jeff Roberson
c168508655 Add two new kernel options to control memory locality on NUMA hardware.
- UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that
   was freed on a different domain from where it was allocated.  This is
   only used for UMA_ZONE_NUMA (first-touch) zones.
 - UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all
   zones.  This tries to maintain locality for kernel memory.

Reviewed by:	gallatin, alc, kib
Tested by:	pho, gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20929
2019-08-06 21:50:34 +00:00
Mark Johnston
98549e2dc6 Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().
Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page.  Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O.  This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20986
2019-07-29 22:01:28 +00:00
Doug Moore
23612f0df3 In swap_pager_putpages, move the initialization of a free-blocks
counter, and the final freeing of freed swap blocks, outside the
region where an object lock is held.  Correct some style(9) and
spelling errors.  Change a panic() to a KASSERT().  Change a boolean_t
to a bool.

Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D21093
2019-07-28 19:32:23 +00:00
Mark Johnston
b16e57a6c9 Rename vm_page_{import,release}() to vm_page_zone_{import,release}().
I would like to use the name vm_page_release() for a different purpose,
and vm_page_{import,release}() are local to vm_page.c.

Reviewed by:	kib
MFC after:	1 week
2019-07-20 18:25:41 +00:00
Doug Moore
312df2c1dd Define vm_map_entry_in_transition to handle an in-transition map
entry, combining code currently in vm_map_unwire and
vm_map_wire_locked into a single function, called by each of them for
entries in transition.

Discussed with: kib, markj
Reviewed by: alc
Approved by: kib, markj (mentors, implicit)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D20833
2019-07-19 20:47:35 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
Mark Johnston
46736e306c Elide the vm_reserv_free_page() call when PG_PCPU_CACHE is set.
Pages with PG_PCPU_CACHE set cannot have been allocated from a
reservation, so as an optimization, skip the call to
vm_reserv_free_page() in this case.  Otherwise, the access of
the corresponding reservation structure often results in a cache
miss.

Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20859
2019-07-08 19:02:40 +00:00
Mark Johnston
d9a73522e3 Add a per-CPU page cache per VM free pool.
Some workloads benefit from having a per-CPU cache for
VM_FREEPOOL_DIRECT pages.

Reviewed by:	dougm, kib
Discussed with:	alc, jeff
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20858
2019-07-08 18:56:30 +00:00
Doug Moore
7b9bcad939 A style-related change, r349791, made unclear the meaning of a
comment. Rewrite that comment to improve its clarity.

Reported by: cem
Reviewed by: alc, cem
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20871
2019-07-07 06:57:04 +00:00
Doug Moore
0cab71bcee Fix style(9) violations involving division by PAGE_SIZE.
Reviewed by: alc
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20847
2019-07-06 15:55:16 +00:00
Doug Moore
31c82722c1 Change blist_next_leaf_alloc so that it can examine more than one leaf
after the one where the possible block allocation begins, and allocate
a larger number of blocks than the current limit. This does not affect
the limit on minimum allocation size, which still cannot exceed
BLIST_MAX_ALLOC.

Use this change to modify swp_pager_getswapspace and its callers, so
that they can allocate more than BLIST_MAX_ALLOC blocks if they are
available.

Tested by: pho
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20579
2019-07-06 06:15:03 +00:00
Doug Moore
56948d177e Based on work posted at https://reviews.freebsd.org/D13484, change
swap_pager_swapoff_object and swp_pager_force_pagein so that they can
page in multiple pages at a time to a swap device, rather than doing
one I/O operation for each page.

Tested by: pho
Submitted by: ota_j.email.ne.jp (Yoshihiro Ota)
Reviewed by: alc, markj, kib
Approved by: kib, markj (mentors)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20635
2019-07-05 16:49:34 +00:00
Doug Moore
d2860f22a4 Move an assignment, drop a label, and change gotos to break statements
in vm_map_unwire. The code generated on amd86 is unchanged.

Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20850
2019-07-04 19:25:30 +00:00
Doug Moore
b71f9b0de6 Replace a 'goto' with an 'else' in vm_map_wire_locked.
Reviewed by: alc
Approved by: markj (mentor)
Differential Revision:	https://reviews.freebsd.org/D20855
2019-07-04 19:17:55 +00:00
Doug Moore
9a0cdf9440 Change boolean_t variables in vm_map_unwire and vm_map_wire_locked to
bool. Drop result variable. Add holes_ok bool to replace repeated
masking of flags parameter.

Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20846
2019-07-04 19:12:13 +00:00
Doug Moore
723413be0c Drop a temp variable from vm_map_insert, with no effect on the
resulting amd64 machine code.

Reviewed by: alc
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20849
2019-07-04 18:28:49 +00:00
Doug Moore
38e220e8df Eliminate a goto and a label in vm_map_wire_locked by inserting an 'else'.
Reviewed by: alc
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20845
2019-07-03 22:41:54 +00:00
Ed Maste
b93a053ca2 correct pmap_ts_referenced return type
pmap_ts_referenced returns a count, not a boolean, and is supposed to
have int as the return type not boolean_t.

This worked previously because boolean_t is an int typedef.

Discussed with:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-07-03 19:59:56 +00:00
Mark Johnston
d70f0ab38d Cache the next queue element when traversing a page queue.
When QUEUE_MACRO_DEBUG_TRASH is configured, removing a queue element
invalidates its queue linkage pointers.  vm_pageout_collect_batch()
was relying on these pointers remaining valid after a removal, so
modify it to fetch the next queued page before dequeuing the current
page.

Submitted by:	Don Morris <dgmorris@earthlink.net>
Reviewed by:	cem, vangyzen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20842
2019-07-03 18:46:39 +00:00
Mark Johnston
9f74cdbf78 Mark pages allocated from the per-CPU cache.
Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase.  In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break.  Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.

The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.

Reviewed by:	alc
Discussed with:	jeff
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20763
2019-07-02 19:51:40 +00:00
Konstantin Belousov
5dc7e31a09 Control implicit PROT_MAX() using procctl(2) and the FreeBSD note
feature bit.

In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement.  Provide procctl(2) verbs to set and query
implicit PROTMAX handling.  The knobs mimic the same per-image flag
and per-process controls for ASLR.

Reviewed by:	emaste, markj (previous version)
Discussed with:	brooks
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D20795
2019-07-02 19:07:17 +00:00
Konstantin Belousov
3730695151 Use traditional 'p' local to designate td->td_proc in kern_mmap.
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D20795
2019-07-02 19:01:14 +00:00
Doug Moore
5201cbabf5 Remove a call to vm_map_simplify_entry from _vm_map_clip_start.
Recent changes to vm_map_protect have made it unnecessary.

Reviewed by: alc
Approved by: kib (mentor)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D20633
2019-06-30 02:08:13 +00:00
Doug Moore
a72dce340d If vm_map_protect fails with KERN_RESOURCE_SHORTAGE, be sure to
simplify modified entries before returning.

Reviewed by: alc, markj (earlier version), kib (earlier version)
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20753
2019-06-28 02:14:54 +00:00
Mark Johnston
0fd977b3fa Add a return value to vm_page_remove().
Use it to indicate whether the page may be safely freed following
its removal from the object.  Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.

This is a step towards an implementation of an atomic reference counter
for each physical page structure.

Reviewed by:	alc, dougm, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20758
2019-06-26 17:37:51 +00:00
Doug Moore
d1d3f7e1d1 Revert r349393, which leads to an assertion failure on bootup, in vm_map_stack_locked.
Reported by: ler@lerctr.org
Approved by: kib, markj (mentors, implicit)
2019-06-26 03:12:57 +00:00
Doug Moore
52499d1739 Eliminate some uses of the prev and next fields of vm_map_entry_t.
Since the only caller to vm_map_splay is vm_map_lookup_entry, move the
implementation of vm_map_splay into vm_map_lookup_helper, called by
vm_map_lookup_entry.

vm_map_lookup_entry returns the greatest entry less than or equal to a
given address, but in many cases the caller wants the least entry
greater than or equal to the address and uses the next pointer to get
to it. Provide an alternative interface to lookup,
vm_map_lookup_entry_ge, to provide the latter behavior, and let
callers use one or the other rather than having them use the next
pointer after a lookup miss to get what they really want.

In vm_map_growstack, the caller wants an entry that includes a given
address, and either the preceding or next entry depending on the value
of eflags in the first entry. Incorporate that behavior into
vm_map_lookup_helper, the function that implements all of these
lookups.

Eliminate some temporary variables used with vm_map_lookup_entry, but
inessential.

Reviewed by: markj (earlier version)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20664
2019-06-25 20:25:16 +00:00
Doug Moore
18cd8bb800 vm_map_protect may return an INVALID_ARGUMENT or PROTECTION_FAILURE
error response after clipping the first map entry in the region to be
reserved. This creates a pair of matching entries that should have
been "simplified" back into one, or never created. This change defers
the clipping of that entry until those two vm_map_protect failure
cases have been ruled out.

Reviewed by: alc
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20711
2019-06-25 07:44:37 +00:00
Brooks Davis
74a1b66cf4 Extend mmap/mprotect API to specify the max page protections.
A new macro PROT_MAX() alters a protection value so it can be OR'd with
a regular protection value to specify the maximum permissions.  If
present, these flags specify the maximum permissions.

While these flags are non-portable, they can be used in portable code
with simple ifdefs to expand PROT_MAX() to 0.

This change allows (e.g.) a region that must be writable during run-time
linking or JIT code generation to be made permanently read+execute after
writes are complete.  This complements W^X protections allowing more
precise control by the programmer.

This change alters mprotect argument checking and returns an error when
unhandled protection flags are set.  This differs from POSIX (in that
POSIX only specifies an error), but is the documented behavior on Linux
and more closely matches historical mmap behavior.

In addition to explicit setting of the maximum permissions, an
experimental sysctl vm.imply_prot_max causes mmap to assume that the
initial permissions requested should be the maximum when the sysctl is
set to 1.  PROT_NONE mappings are excluded from this for compatibility
with rtld and other consumers that use such mappings to reserve
address space before mapping contents into part of the reservation.  A
final version this is expected to provide per-binary and per-process
opt-in/out options and this sysctl will go away in its current form.
As such it is undocumented.

Reviewed by:	emaste, kib (prior version), markj
Additional suggestions from:	alc
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18880
2019-06-20 18:24:16 +00:00
Mark Johnston
ee1f168540 Group vm_page_activate()'s definition with other related functions.
No functional change intended.

MFC after:	3 days
2019-06-19 21:36:00 +00:00
Doug Moore
4766eba1df Critical comments were lost in r349203. This patch seeks to restore
the lost information in new comments.

Reported by: alc
Reviewed by: alc
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20632
2019-06-15 04:30:13 +00:00
Doug Moore
771315283b Avoid using the prev field of vm_map_entry_t in two functions that
iterate over consecutive vm_map entries, and that can easily just
'remember' the prev value instead of looking it up.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20628
2019-06-14 03:15:54 +00:00
Doug Moore
af1d6d6a11 Create a function for creating objects to back map entries, and one
for giving cred to a map entry backed by an object, and use them
instead of the code duplicated inline now.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20370
2019-06-13 20:09:07 +00:00
Doug Moore
e65d58a0fe To test to see if a free space is big enough compare the required
length to the difference of the two offsets that define the gap, to
avoid overflow, rather that adding the length to an offset and
comparing that to another offset.

This addresses an overflow issue reported by Peter Holm on i386.

Reported by: pho
Tested by: pho
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20594
2019-06-11 22:41:39 +00:00
Doug Moore
f8c8b2e8a0 r348879 introduced a wrong-way comparison that broke mmap.
This change rights that comparison.

Reported by: pho
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20595
2019-06-10 22:06:40 +00:00
Doug Moore
5a0879da80 The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.

Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.

The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.

In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.

Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
Doug Moore
77555b849d Change the check for 'size' wrapping around to zero in kern_mmap to account
for both the lower and upper bound modifications. Change the error returned
to ENOMEM. Rename the parameter size to len and make size a local variable
that stores the value of len after it has been modified.

This addresses concerns expressed by Bruce Evans after r348843.

Reported by: brde@optusnet.com.au
Reviewed by: kib, markj (mentors)
MFC after: 3 days
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20592
2019-06-10 21:26:14 +00:00
John Baldwin
0b96ca3310 Remove an overly-aggressive assertion.
While it is true that the new vmspace passed to vmspace_switch_aio
will always have a valid reference due to the AIO job or the extra
reference on the original vmspace in the worker thread, it is not true
that the old vmspace being switched away from will have more than one
reference.

Specifically, when a process with queued AIO jobs exits, the exit hook
in aio_proc_rundown will only ensure that all of the AIO jobs have
completed or been cancelled.  However, the last AIO job might have
completed and woken up the exiting process before the worker thread
servicing that job has switched back to its original vmspace.  In that
case, the process might finish exiting dropping its reference to the
vmspace before the worker thread resulting in the worker thread
dropping the last reference.

Reported by:	np
Reviewed by:	alc, markj, np, imp
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D20542
2019-06-10 19:01:54 +00:00
Doug Moore
97220a279f There are times when a len==0 parameter to mmap is okay. But on a
32-bit machine, a len parameter just a few bytes short of 4G, rounded
up to a page boundary and hitting zero then, is not okay. Return
failure in that case.

Reported by: pho
Reviewed by: alc, kib (mentor)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D20580
2019-06-10 03:07:10 +00:00
Konstantin Belousov
452a2db863 Style MAP_ENTRY_ and MAP_ definitions.
Spell all bits in the hex constants.
Since all lines are modified, consistently use <tab> after #define.

Reviewed by:	alc (previous version), dougm
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D20560
2019-06-08 20:28:04 +00:00
Doug Moore
7c022327ab Simple code refactoring originally in D13484.
Extract swp_pager_force_dirty() and swp_pager_force_launder() out of
swp_pager_force_pagein().

Extract swap_pager_swapoff_object() out of swap_pager_swapoff().

Submitted by: ota_j.email.ne.jp
Reviewed by: alc, dougm
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20545
2019-06-08 17:49:17 +00:00
Mark Johnston
88ea538a98 Replace uses of vm_page_unwire(m, PQ_NONE) with vm_page_unwire_noq(m).
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone.  But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place.  No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20470
2019-06-07 18:23:29 +00:00
Alexander Motin
3b2f2cb8e9 Allow UMA hash tables to expand faster then 2x in 20 seconds.
ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables.
With initial hash table size of only 32 elements it takes ~20 expansions
or ~400 seconds to adapt to handling 220GB ZFS ARC.  During that time not
only the hash table is highly inefficient, but also each of those expan-
sions takes significant time with the lock held, blocking operation.

On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change
reduces time needed to first time read 240GB from ~300-400s, during which
system is quite busy and unresponsive, to only ~150s with light CPU load
and just 5 sub-second CPU spikes to expand the hash table.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-06 23:57:28 +00:00
Doug Moore
f96e8a0bab The means of finding ranges of free pages was changed for
vm_reserv_break in r348484, and there was found to improve performance
minutely and reduce code size. This change applies a similar change to
vm_reserv_reclaim_config, expecting similar benefits. This change also
allows quick rejection of page ranges that are unsuitable on account
of alignment or boundary issues, where those issues are processed a
page at a time in the current implementation.  For contrived test
cases, this can make finding a reservation satisfying a major
alignment requirement around 30 times faster.

Tested by: pho
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20274
2019-06-06 16:28:34 +00:00