4329 Commits

Author SHA1 Message Date
Mark Johnston
5cff1f4dc3 Introduce vm_page_astate.
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state.  The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields.  No functional change intended.

Reviewed by:	alc, jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22650
2019-12-10 18:14:50 +00:00
Doug Moore
7887f00d2b Revert r355505. The code that it allowed to compile has been removed. 2019-12-09 05:09:46 +00:00
Doug Moore
8b75b1ad0d Define a vm_map method for user-space for advancing from a map entry
to its successor in cases where examining a map entry requires a
helper like kvm_read_all.  Use that method, with kvm_read_all, to fix
procstat_getfiles_kvm, which tries to find the successor now without
using such a helper.  This addresses a problem introduced by r355491.

Reviewed by: markj (previous version)
Discussed with: kib
Differential Revision: https://reviews.freebsd.org/D22728
2019-12-08 22:33:51 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Jeff Roberson
3b490537f4 Fix two problems with r355149. The sysctl name collision code assumed that
zones would never be freed.  In the case of tmpfs this was not true.  While
here test for the right bit to disable the keg related sysctls for zones
that don't have kegs.

Reported by:	pho
Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22655
2019-12-08 01:55:23 +00:00
Jeff Roberson
cff8481de4 It is safe to wire a page while the object is busy.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22636
2019-12-08 01:49:53 +00:00
Jeff Roberson
2306558c54 It is now safe to rename a page that is still on a queue. Allowing this
is necessary for a forthcoming patch.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22636
2019-12-08 01:49:03 +00:00
Jeff Roberson
d8ad7b7d6a Do not assert that the object lock is held in vm_object_set_writeable_dirty.
A valid reference is all that is required.  If we race with a deallocation
we will harmlessly misidentify the type of an already dead object.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22636
2019-12-08 01:47:29 +00:00
Jeff Roberson
fb1d575ceb Reduce duplication in grab functions by providing allocflags based inlines.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22635
2019-12-08 01:16:22 +00:00
Jeff Roberson
1e0701e1e5 Use a variant slab structure for offpage zones. This saves space in
embedded slabs but also is an opportunity to tidy up code and add
accessor inlines.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22609
2019-12-08 01:15:06 +00:00
Mark Johnston
c0829bb1d6 Add casts required by the 32-bit build after r355491. 2019-12-08 00:02:36 +00:00
Mark Johnston
3098cd73a3 Provide vm_map_entry traversal routines to userspace.
This is required for now to allow libprocstat to compile.

Discussed with:	dougm
2019-12-07 19:36:40 +00:00
Mateusz Guzik
91caa9b8c1 vm: fix sysctl vm.kstack_cache_size change report
Cache gets resized correctly, but sysctl reports the wrong number:
# sysctl vm.kstack_cache_size=512
vm.kstack_cache_size: 128 -> 128

patched:
vm.kstack_cache_size: 128 -> 512

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22717
Fixes:	r355002 "Revise the page cache size policy."
2019-12-07 17:28:41 +00:00
Doug Moore
c1ad5342a6 Remove the next and prev fields from vm_map_entry, to save a bit of
space.  Where the vm_map tree now has null pointers, store pointers to
next and previous entries in right and left fields, making the binary
tree threaded.  Have the predecessor and successor functions compute
what the prev and next fields previously stored.

Reviewed by: markj, kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D21964
2019-12-07 17:14:33 +00:00
Justin Hibbits
caef3e1280 powerpc/pmap: NUMA-ize vm_page_array on powerpc
Summary:
This matches r351198 from amd64.  This only applies to AIM64 and Book-E.
On AIM64 it short-circuits with one domain, to behave similar to
existing.  Otherwise it will allocate 16MB huge pages to hold the page
array, across all NUMA domains.  On the first domain it will shift the
page array base up, to "upper-align" the page array in that domain, so
as to reduce the number of pages from the next domain appearing in this
domain.  After the first domain, subsequent domains will be allocated in
full 16MB pages, until the final domain, which can be short.  This means
some inner domains may have pages accounted in earlier domains.

On Book-E the page array is setup at MMU bootstrap time so that it's
always mapped in TLB1, on both 32-bit and 64-bit.  This reduces the TLB0
overhead for touching the vm_page_array, which reduces up to one TLB
miss per array access.

Since page_range (vm_page_startup()) is no longer used on Book-E but is on
32-bit AIM, mark the variable as potentially unused, rather than using a
nasty #if defined() list.

Reviewed by:	luporl
Differential Revision:	https://reviews.freebsd.org/D21449
2019-12-07 03:34:03 +00:00
Mark Johnston
a6f21d15dd Fix fault_type handling in vm_map_lookup().
Suppose that the map entry is wired, so that we later assign
fault_type = entry->protection.  Suppose further that we jump back to
RetryLookup.  Then fault_type will no longer contain the original
fault protection mask, but instead that of the wired entry.

Submitted by:	Wuyang Chung <wuyang.chung1@gmail.com>
Reviewed by:	kib
MFC after:	3 days
Github PR:	https://github.com/freebsd/freebsd/pull/419
Differential Revision:	https://reviews.freebsd.org/D22683
2019-12-06 23:39:08 +00:00
Mark Johnston
ed2f945a39 Fix an off-by-one error in vm_map_pmap_enter().
If the starting pindex is equal to object->size, there is nothing to do.
This was harmless since the rest of vm_map_pmap_enter() has no effect
when psize == 0.

Submitted by:	Wuyang Chung <wuyang.chung1@gmail.com>
Reviewed by:	alc, dougm, kib
MFC after:	1 week
Github PR:	https://github.com/freebsd/freebsd/pull/417
Differential Revision:	https://reviews.freebsd.org/D22678
2019-12-04 19:46:48 +00:00
Andrew Turner
b75c4efcd2 Fix the signature for zone_import and zone_release
These are cast to uma_import and uma_release functions. Use the signature
for these in the zone functions.

This was found with an experimental Kernel CFI. It will complain if the
signature is different than what a function pointer expects. The
simplest way to fix these is to correct the signature.

Reviewed by:	rlibby
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22671
2019-12-04 18:40:05 +00:00
Jeff Roberson
9b78b1f433 Use a precise bit count for the slab free items in UMA. This significantly
shrinks embedded slab structures.

Reviewed by:	markj, rlibby (prior version)
Differential Revision:	https://reviews.freebsd.org/D22584
2019-12-02 22:44:34 +00:00
Jeff Roberson
0f9e06e18b Fix a few places that free a page from an object without busy held. This is
tightening constraints on busy as a precursor to lockless page lookup and
should largely be a NOP for these cases.

Reviewed by:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22611
2019-12-02 22:42:05 +00:00
Konstantin Belousov
67388836f3 Store the bottom of the shadow chain in OBJ_ANON object->handle member.
The handle value is stable for all shadow objects in the inheritance
chain.  This allows to avoid descending the shadow chain to get to the
bottom of it in vm_map_entry_set_vnode_text(), and eliminate
corresponding object relocking which appeared to be contending.

Change vm_object_allocate_anon() and vm_object_shadow() to handle more
of the cred/charge initialization for the new shadow object, in
addition to set up the handle.

Reported by:	jeff
Reviewed by:	alc (previous version), jeff (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differrential revision:	https://reviews.freebsd.org/D22541
2019-12-01 20:43:04 +00:00
Jeff Roberson
886b90219a Restore swap space accounting for non-anonymous swap objects. This was
broken in r355082.  Reduce some locking in nearby related object type
checks.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22565
2019-11-29 19:57:49 +00:00
Jeff Roberson
f2410510db Avoid acquiring the object lock if color is already set. It can not be
unset until the object is recycled so this check is stable.  Now that we
can acquire the ref without a lock it is not necessary to group these
operations and we can avoid it entirely in many cases.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22565
2019-11-29 19:49:20 +00:00
Jeff Roberson
26c4e9831b Fix a perf regression from r355122. We can use a shared lock to drop the
last ref on vnodes.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22565
2019-11-29 19:47:40 +00:00
Jeff Roberson
6d6a03d7a8 Handle large mallocs by going directly to kmem. Taking a detour through
UMA does not provide any additional value.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22563
2019-11-29 03:14:10 +00:00
Doug Moore
85b7bedb15 Functions that call vm_map_splay_merge sometimes set data fields
(e.g. root->left = NULL) to affect the behavior of that function. This
change stops that data manipulation, and instead calls a pair of
functions, one for the left direction and the other for the right,
with the function called depending whether or not we currently null
the root child in that direction to control the behavior of
vm_map_splay_merge.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22589
2019-11-29 02:06:45 +00:00
Jeff Roberson
584061b480 Garbage collect the mostly unused us_keg field. Use appropriately named
union members in vm_page.h to store the zone and slab.  Remove some nearby
dead code.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22564
2019-11-28 07:49:25 +00:00
Ryan Libby
35ec24f362 uma: move sysctl vm.uma defn out from under INVARIANTS
Fix non-INVARIANTS builds after r355149.

Reported by:	Michael Butler <imb@protected-networks.net>
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22588
2019-11-28 04:15:16 +00:00
Jeff Roberson
20a4e15451 Implement a sysctl tree for uma zones to assist in debugging and provide
more statistcs than are exported via the ABI stable vmstat interface.
Rename uz_count to uz_bucket_size because even I was confused by the
name after returning to the source years later.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22554
2019-11-28 00:19:09 +00:00
Jeff Roberson
0a81b4395e Refactor uma_zfree_arg into several functions to make control flow more
clear and icache usage cleaner.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22491
2019-11-27 23:19:06 +00:00
Doug Moore
1867d2f2e9 Inline some splay helper functions to improve performance on a
micro-benchmark.

Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22544
2019-11-27 21:00:44 +00:00
Ryan Libby
ca293436d1 uma: trash memory when ctor/dtor supplied too
On INVARIANTS kernels, UMA has a use-after-free detection mechanism.
This mechanism previously required that all of the ctor/dtor/uminit/fini
arguments to uma_zcreate() be NULL in order to function.  Now, it only
requires that uminit and fini be NULL; now, the trash ctor and dtor will
be called in addition to any supplied ctor or dtor.

Also do a little refactoring for readability of the resulting logic.

This enables use-after-free detection for more zones, and will allow for
simplification of some callers that worked around the previous
restriction (see kern_mbuf.c).

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20722
2019-11-27 19:49:55 +00:00
Jeff Roberson
a67d540832 Use atomics in more cases for object references. We now can completely
omit the object lock if we are above a certain threshold.  Hold only a
single vnode reference when the vnode object has any ref > 0.  This
allows us to only lock the object and vnode on 0-1 and 1-0 transitions.

Differential Revision:	https://reviews.freebsd.org/D22452
2019-11-27 00:39:23 +00:00
Jeff Roberson
beb8beef81 Refactor uma_zalloc_arg(). It is a mess of gotos and code which doesn't
make sense after many partial refactors.  Attempt to make a smaller cache
footprint for the fast path.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22470
2019-11-26 22:17:02 +00:00
Ryan Libby
6a14746c01 vm_object_collapse_scan_wait: drop locks before reacquiring
Regression from r352174.  In the vm_page_rename() failure case we forgot
to unlock the vm object locks before sleeping and reacquiring them.

Reviewed by:	jeff
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22542
2019-11-25 07:38:31 +00:00
Jeff Roberson
4d987866e6 Move anonymous object copying for fork into its own routine and so that we
can avoid locking non-anonymous objects.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D22472
2019-11-25 07:13:05 +00:00
Doug Moore
2767c9f36a Where 'current' is used to index over vm_map entries, use
'entry'. Where 'entry' is used to identify the starting point for
iteration, use 'first_entry'. These are the naming conventions used in
most of the vm_map.c code.  Where VM_MAP_ENTRY_FOREACH can be used, do
so. Squeeze a few lines to fit in 80 columns.  Where lines are being
modified for these reasons, look to remove style(9) violations.

Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D22458
2019-11-25 02:19:47 +00:00
Konstantin Belousov
3236244936 Ignore object->handle for OBJ_ANON objects.
Note that the change in vm_object_collapse() is arguably a correctness
fix.  We must not collapse into content-identity carrying objects.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22467
2019-11-24 19:18:12 +00:00
Konstantin Belousov
b631c36f0d Record part of the owner struct thread pointer into busy_lock.
Record as much bits from curthread into busy_lock as fits.  Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0).  Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler.  For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22298
2019-11-24 19:12:23 +00:00
Mark Johnston
9c770a27ce Simplify vm_pageout_init_domain() and add a "big picture" comment.
Stop subtracting 1024/200 from vmd_page_count/200.  I cannot see how
such precise accounting can make a difference on modern systems.

Add some explanation of what the page daemon does and how it handles
memory shortages.

Reviewed by:	dougm
Discussed with:	jeff, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22396
2019-11-22 16:31:43 +00:00
Mark Johnston
8fc2550837 Reclaim memory from UMA if the page daemon is struggling.
Use the UMA reclaim thread to asynchronously drain all caches if
there is a severe shortage in a domain.  Otherwise we only trigger UMA
reclamation every 10s even when the system has completely run out of
memory.

Stop entirely draining the caches when one domain falls below its min
threshold.  In some workloads it is normal for one NUMA domain to end
up being nearly depleted by kernel memory allocations, for example for
the ZFS ARC.  The domainset iterators skip domains below the
vmd_min_free theshold on the first iteration, so we should allow that
mechanism to limit further depletion of the domain's free pages before
taking the extreme step of calling uma_reclaim(UMA_RECLAIM_DRAIN_CPU).

Discussed with:	jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22395
2019-11-22 16:31:30 +00:00
Mark Johnston
bf0d60af92 Update the checks in vm_page_zone_import().
- Remove the cnt == 1 check.  UMA passes cnt == 1 when it has disabled
  per-CPU caching.  In this case we might as well just allocate a single
  page and return it to the caller, since the caller is going to do
  exactly that anyway if the UMA cache allocation attempt fails.
- Don't replenish caches if the domain is severely short on free pages.
  With large buckets we may otherwise quickly exacerbate a situation
  where the page daemon is failing to keep up.
- Don't replenish caches if the calling thread belongs to the page
  daemon, which should avoid creating extra memory pressure when it is
  trying to free memory.  Virtually all such allocations while occur in
  the context of laundering, where the laundry thread must allocate
  slabs for various swap and I/O-related UMA zones.

Reviewed by:	kib
Discussed with:	alc, jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22394
2019-11-22 16:31:10 +00:00
Mark Johnston
003cf08ba9 Revise the page cache size policy.
In r353734 the use of the page caches was limited to systems with a
relatively large amount of RAM per CPU.  This was to mitigate some
issues reported with the system not able to keep up with memory pressure
in cases where it had been able to do so prior to the addition of the
direct free pool cache.  This change re-enables those caches.

The change modifies uma_zone_set_maxcache(), which was introduced
specifically for the page cache zones.  Rather than using it to limit
only the full bucket cache, have it also set uz_count_max to provide an
upper bound on the per-CPU cache size that is consistent with the number
of items requested.  Remove its return value since it has no use.

Enable the page cache zones unconditionally, and limit them to 0.1% of
the domain's pages.  The limit can be overridden by the
vm.pgcache_zone_max tunable as before.

Change the item size parameter passed to uma_zcache_create() to the
correct size, and stop setting UMA_ZONE_MAXBUCKET.  This allows the page
cache buckets to be adaptively sized, like the rest of UMA's caches.
This also causes the initial bucket size to be small, so only systems
which benefit from large caches will get them.

Reviewed by:	gallatin, jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22393
2019-11-22 16:30:47 +00:00
Mark Johnston
b378d29687 Fix locking in vm_reserv_reclaim_contig().
We were not properly handling the case where the trylock of the
reservaton fails, in which case we could leak reservation lock.

Introduce a marker reservation to implement precise scanning in
vm_reserv_reclaim_contig().  Before, a race could result in early
termination of the scan in rare situations.  Use the marker's lock to
serialize scans of the partpop queue so that a global marker structure
can be used.  Modify vm_reserv_reclaim_inactive() to handle the presence
of a marker while minimizing the hold time of domain-global locks.

Reviewed by:	alc, jeff, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22392
2019-11-22 16:28:52 +00:00
Andrew Turner
09a65f9ff5 As with r354905 use uint16_t to store aflags on the stack and as function
arguments as the aflags size in vm_page_t has increased.

Sponsored by:	DARPA, AFRL
2019-11-20 18:00:43 +00:00
Andrew Turner
ad216bc10d Use atomic_load_16 to load aflags as it's a uint16_t after r354820.
Sponsored by:	DARPA, AFRL
2019-11-20 17:49:58 +00:00
Doug Moore
83704cc236 Instead of looking up a predecessor or successor to the current map
entry, when that entry has been seen already, keep the
already-looked-up value in a variable and use that instead of looking
it up again.

Approved by: alc, markj (earlier version), kib (earlier version)
Differential Revision: https://reviews.freebsd.org/D22348
2019-11-20 16:06:48 +00:00
Jeff Roberson
71353f7a2f When we set OFFPAGE to limit fragmentation we should also set VTOSLAB
so that we avoid the hashtables.  The hashtable is now only required if
a zone is created with OFFPAGE specified initially, not internally.  This
flag signals to UMA that it can't touch the allocated memory and so
can't store a slab pointer in the containing page.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22453
2019-11-20 01:57:33 +00:00
Jeff Roberson
51b867e56b Only keep anonymous objects on shadow lists. This eliminates locking of
globally visible objects when they are part of a backing chain.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22423
2019-11-20 00:31:14 +00:00
Jeff Roberson
7f935055d3 Remove unnecessary object locking from the vnode pager. Recent changes to
busy/valid/dirty locking make these acquires redundant.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22186
2019-11-19 23:30:09 +00:00