Commit Graph

4377 Commits

Author SHA1 Message Date
Ryan Libby
27ca37acb7 uma: grow slabs to enforce minimum memory efficiency
Memory efficiency can be poor with awkward item sizes (e.g. 1/2 or 1
page size + epsilon).  In order to achieve a minimum memory efficiency,
select a slab size with a potentially larger number of pages if it
yields a lower portion of waste.

This may mean using page_alloc instead of uma_small_alloc, which could
be more costly.

Discussed with:	jeff, mckusick
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23239
2020-02-04 22:40:34 +00:00
Ryan Libby
ec0d828071 uma: add UMA_ZONE_CONTIG, and a default contig_alloc
For now, copy the mbuf allocator.

Reviewed by:	jeff, markj (previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23237
2020-02-04 22:40:11 +00:00
Ryan Libby
5ba16cf3d7 uma: pcpu_page_free needs to startup_free pages from startup_alloc
After r357392, it is apparent that we do have some early-boot PCPU
zones.  Make it so we can safely free pages from them if they are
actually used during early boot.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23496
2020-02-04 22:39:58 +00:00
Jeff Roberson
ee9e43f8dd Add an explicit busy state for free pages. This improves behavior with
potential bugs that access freed pages as well as providing a path
towards lockless page lookup.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23444
2020-02-04 20:33:01 +00:00
Jeff Roberson
e84130a0c0 Use literal bucket sizes for smaller buckets rather than the rounding
system.  Small bucket sizes already pack well even if they are an odd
number of words.  This prevents any potential new instances of the
problem fixed in r357463 as well as making the system easier to
understand.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23494
2020-02-04 20:28:06 +00:00
Konstantin Belousov
8d34a3bf7d Enable vm_object_mightbedirty() and vm_object_page_clean() for swap
objects backing tmpfs vnodes data.

The clean scan is limited to only remove write permissions from the
mapped pages of the objects.  This fixes the issue that tmpfs vnode
mtime is not updated from writes to the mmaped area after the initial
page-in.

Noted by:	mjg
Reviewed by:	markj
Discussed with:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23432
2020-02-04 19:03:37 +00:00
Jeff Roberson
dc3915c8c6 Use STAILQ instead of TAILQ for bucket lists. We only need FIFO behavior
and this is more space efficient.

Stop queueing recently used buckets to the head of the list.  If the bucket
goes to a different processor the cache coherency will be more expensive.
We already try to encourage cache-hot behavior in the per-cpu layer.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D23493
2020-02-04 02:41:24 +00:00
Mark Johnston
36cb95c736 Disable the smallest UMA bucket size on 32-bit platforms.
With r357314, sizeof(struct uma_bucket) grew to 16 bytes on 32-bit
platforms, so BUCKET_SIZE(4) is 0.  This resulted in the creation of a
bucket zone for buckets with zero capacity.  A more general fix is
planned, but for now this bandaid allows 32-bit platforms to boot again.

PR:		243837
Discussed with:	jeff
Reported by:	pho, Jenkins via lwhsu
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-02-03 19:29:02 +00:00
Warner Losh
58aa35d429 Remove sparc64 kernel support
Remove all sparc64 specific files
Remove all sparc64 ifdefs
Removee indireeect sparc64 ifdefs
2020-02-03 17:35:11 +00:00
Mateusz Guzik
f1fa1ba3d0 Fix up various vnode-related asserts which did not dump the used vnode 2020-02-03 14:25:32 +00:00
Jeff Roberson
f96d4157a7 Fix a bug in r356776 where the page allocator was not properly restored to
the percpu page allocator after it had been temporarily overridden by
startup_alloc.

Reported by:	pho, bdragon
2020-02-01 23:46:30 +00:00
Mark Johnston
f0a273c00f Remove a couple of lingering usages of the page lock.
Update vm_page_scan_contig() and vm_page_reclaim_run() to stop using
vm_page_change_lock().  It has no use after r356157.  Remove
vm_page_change_lock() now that it has no users.

Remove an unncessary check for wirings in vm_page_scan_contig(), which
was previously checking twice.  The check is racy until
vm_page_reclaim_run() ensures that the page is unmapped, so one check is
sufficient.

Reviewed by:	jeff, kib (previous versions)
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D23279
2020-02-01 18:23:51 +00:00
Mateusz Guzik
643656cfaf vfs: replace VOP_MARKATIME with VOP_MMAPPED
The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23422
2020-02-01 06:46:55 +00:00
Jeff Roberson
9e47b34110 Fix LINT build with MEMGUARD. 2020-01-31 02:03:22 +00:00
Jeff Roberson
d4665eaa66 Implement a safe memory reclamation feature that is tightly coupled with UMA.
This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm.  This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost.  The memory overhead
is significantly lessened by limiting the free-to-use latency.  A synthetic
test uses 1/20th of the memory vs Epoch.  There is significant further
discussion in the comments and code review.

This code should be considered experimental.  I will write a man page after
it has settled.  After further validation the VM will begin using this
feature to permit lockless page lookups.

Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures.  I will commit a stress testing
tool in a follow-up.

Reviewed by:	mmacy, markj, rlibby, hselasky
Discussed with:	sbahara
Differential Revision:	https://reviews.freebsd.org/D22586
2020-01-31 00:49:51 +00:00
Konstantin Belousov
b70f6e1513 Restore OOM logic on page fault after r357026.
Right now OOM is initiated unconditionally on the page allocation
failure, after the wait.

Reported by:	Mark Millard <marklmi@yahoo.com>
Reviewed by:	cy, markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D23409
2020-01-29 12:02:47 +00:00
Konstantin Belousov
cd0047f3a9 Handle a race of collapse with a retrying fault.
Both vm_object_scan_all_shadowed() and vm_object_collapse_scan() might
observe an invalid page left in the default backing object by the
fault handler that retried.  Check for the condition and refuse to collapse.

Reported and tested by:	pho
Reviewed by:	jeff
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D23331
2020-01-24 19:42:53 +00:00
Doug Moore
c7b23459b2 Most uses of vm_map_clip_start follow a call to vm_map_lookup. Define
an inline function vm_map_lookup_clip_start that invokes them both and
use it in places that invoke both. Drop a couple of local variables
made unnecessary by this function.

Reviewed by:	markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22987
2020-01-24 07:48:11 +00:00
Mark Johnston
e6bd3a812d vm_map_submap(): Avoid unnecessary clipping.
A submap can only be created from an entry spanning the entire request
range.  In particular, if vm_map_lookup_entry() returns false or the
returned entry contains "end".

Since the only use of submaps in FreeBSD is for the static pipe and
execve argument KVA maps, this has no functional effect.

Github PR:	https://github.com/freebsd/freebsd/pull/420
Submitted by:	Wuyang Chung <wuyang.chung1@gmail.com> (original)
Reviewed by:	dougm, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D23299
2020-01-23 16:45:10 +00:00
Jeff Roberson
fb4d37eac1 (fault 9/9) Move zero fill into a dedicated function to make the object lock
state more clear.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23326
2020-01-23 05:23:37 +00:00
Jeff Roberson
be9d4fd6b4 (fault 8/9) Restructure some code to reduce duplication and simplify flow
control.

Reviewed by:	dougm, kib, markj
Differential Revision:	https://reviews.freebsd.org/D23321
2020-01-23 05:22:02 +00:00
Jeff Roberson
df794f5caf (fault 7/9) Move fault population and allocation into a dedicated function
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23320
2020-01-23 05:19:39 +00:00
Jeff Roberson
5909dafea9 (fault 6/9) Move getpages and associated logic into a dedicated function.
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23311
2020-01-23 05:18:00 +00:00
Jeff Roberson
91eb2e908f (fault 5/9) Move the backing_object traversal into a dedicated function.
Reviewed by:	dougm, kib, markj
Differential Revision:	https://reviews.freebsd.org/D23310
2020-01-23 05:14:41 +00:00
Jeff Roberson
5936b6a8f1 (fault 4/9) Move copy-on-write into a dedicated function.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23304
2020-01-23 05:11:01 +00:00
Jeff Roberson
fcb0475833 (fault 3/9) Move map relookup into a dedicated function.
Add a new VM return code KERN_RESTART which means, deallocate and restart in
fault.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23303
2020-01-23 05:07:01 +00:00
Jeff Roberson
c308a3a6c9 (fault 2/9) Move map lookup into a dedicated function.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23302
2020-01-23 05:05:39 +00:00
Jeff Roberson
2c2f4413cc (fault 1/9) Move a handful of stack variables into the faultstate.
This additionally fixes a potential bug/pessimization where we could fail to
reload the original fault_type on restart.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23301
2020-01-23 05:03:34 +00:00
Ryan Libby
8d1c459ae5 uma: fix zone domain overlaying pcpu cache with disabled cpus
UMA zone structures have two arrays at the end which are sized according
to the machine: an array of CPU count length, and an array of NUMA
domain count length.  The CPU counting was wrong in the case where some
CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the
second array to be overlaid with the first.

Reported by:	olivier
Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23318
2020-01-23 04:56:38 +00:00
Ryan Libby
7e2406774e uma: report leaks more accurately
Previously UMA had some false negatives in the leak report at keg
destruction time, where it only reported leaks if there were free items
in the slab layer (rather than allocated items), which notably would not
be true for single-item slabs (large items).  Now, report a leak if
there are any allocated pages, and calculate and report the number of
allocated items rather than free items.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23275
2020-01-23 04:56:34 +00:00
Jeff Roberson
91e31c3c08 Consistently use busy and vm_page_valid() rather than touching page bits
directly.  This improves API compliance, asserts, etc.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23283
2020-01-23 04:54:49 +00:00
Jeff Roberson
530cc6a25d Some architectures with DMAP still consume boot kva. Simplify the test for
claiming kva in uma_startup2() to handle this.

Reported by:	bdragon
2020-01-23 03:37:35 +00:00
Jeff Roberson
5949b1ca8c Move readahead and dropbehind fault functionality into a helper routine for
clarity.

Reviewed by:	dougm, kib, markj
Differential Revision:	https://reviews.freebsd.org/D23282
2020-01-21 00:12:57 +00:00
Jeff Roberson
1e40fe41c5 Reduce object locking in vm_fault. Once we have an exclusively busied page we
no longer need an object lock.  This reduces the longest hold times and
eliminates some trylock code blocks.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23034
2020-01-20 22:49:52 +00:00
Jeff Roberson
d6e13f3b4d Don't hold the object lock while calling getpages.
The vnode pager does not want the object lock held.  Moving this out allows
further object lock scope reduction in callers.  While here add some missing
paging in progress calls and an assert.  The object handle is now protected
explicitly with pip.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23033
2020-01-19 23:47:32 +00:00
Jeff Roberson
9c83ff2d86 It has not been possible to recursively terminate a vnode object for some time
now.  Eliminate the dead code that supports it.

Approved by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22908
2020-01-19 18:36:03 +00:00
Jeff Roberson
98087a066f Make collapse synchronization more explicit and allow it to complete during
paging.

Shadow objects are marked with a COLLAPSING flag while they are collapsing with
their backing object.  This gives us an explicit test rather than overloading
paging-in-progress.  While split is on-going we mark an object with SPLIT.
These two operations will modify the swap tree so they must be serialized
and swap_pager_getpages() can now directly detect these conditions and page
more conservatively.

Callers to vm_object_collapse() now will reliably wait for a collapse to finish
so that the backing chain is as short as possible before other decisions are
made that may inflate the object chain.  For example, split, coalesce, etc.
It is now safe to run fault concurrently with collapse.  It is safe to increase
or decrease paging in progress with no lock so long as there is another valid
ref on increase.

This change makes collapse more reliable as a secondary benefit.  The primary
benefit is making it safe to drop the object lock much earlier in fault or
never acquire it at all.

This was tested with a new shadow chain test script that uncovered long
standing bugs and will be integrated with stress2.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22908
2020-01-19 18:30:23 +00:00
Andrew Gallatin
2052680238 pcpu_page_alloc: guard against empty NUMA domains
Some systems, such as higher end Threadripper, may have
NUMA domains with no physical memory, Don't allocate
from these domains.

This fixes a "panic: vm_wait in early boot" on my 2990WX desktop

Reviewed by:	jeff
Sponsored by:	Netflix
2020-01-18 18:25:37 +00:00
Jeff Roberson
5844774900 Fix a long standing bug that was made worse in r355765. When we are cowing a
page that was previously mapped read-only it exists in pmap until pmap_enter()
returns.  However, we held no reference to the original page after the copy
was complete.  This allowed vm_object_scan_all_shadowed() to collapse an
object that still had pages mapped.  To resolve this, add another page pointer
to the faultstate so we can keep the page xbusy until we're done with
pmap_enter().  Handle busy pages in scan_all_shadowed.  This is already done
in vm_object_collapse_scan().

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23155
2020-01-17 03:44:04 +00:00
Jeff Roberson
a81c400e75 Simplify VM and UMA startup by eliminating boot pages. Instead use careful
ordering to allocate early pages in the same way boot pages were but only
as needed.  After the KVA allocator has started up we allocate the KVA that
we consumed during boot.  This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.

Parts of this patch were written and tested by markj.

Reviewed by:	glebius, markj
Differential Revision:	https://reviews.freebsd.org/D23102
2020-01-16 05:01:21 +00:00
Alexander Motin
ace409ce9c Restore loop break in vm_pageout_lowmem().
r355004 removed return statement from this loop with intention to also
call uma_reclaim_wakeup().  But in case of vm.lowmem_period=0 it causes
infinite loop.

Reviewed by:	markj
Sponsored by:	iXsystems, Inc.
2020-01-14 03:27:57 +00:00
Ryan Libby
9b8db4d0a0 uma: split slabzone into two sizes
By allowing more items per slab, we can improve memory efficiency for
small allocs.  If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory.  So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab).  The
practical effect should be reduced memory usage for counter(9).

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23149
2020-01-14 02:14:15 +00:00
Ryan Libby
e63a1c2f52 uma: fixup some ktr messages
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23148
2020-01-14 02:13:46 +00:00
Mateusz Guzik
a314aba874 vm: add missing CLTFLAG_MPSAFE annotations
This covers all vm/* files.
2020-01-12 05:08:57 +00:00
Gleb Smirnoff
9328cbc047 Always multiple vm.pgcache_zone_max to number of CPUs, and rename it
respectively.  The tunable controls how big is the size of per-cpu
vm page cache.  Previously the value was split for all CPUs in system,
so configuring same value on machines with different count of CPUs
yielded in different cache size available to a particular CPU.

Reviewed by:	markj
Obtained from:	Netflix
2020-01-10 19:32:08 +00:00
Mark Johnston
860bb7a04c UMA: Don't destroy zones after the system shutdown process starts.
Some kernel subsystems, notably ZFS, will destroy UMA zones from a
shutdown eventhandler.  This causes the zone to be drained.  For slabs
that are mapped into KVA this can be very expensive and so it needlessly
delays the shutdown process.

Add a new state to the "booted" variable, BOOT_SHUTDOWN.  Once
kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy()
into a no-op, provided that the zone does not have a custom finalization
routine.

PR:		242427
Reviewed by:	jeff, kib, rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23066
2020-01-09 19:17:42 +00:00
Ryan Libby
4a8b575c6b uma: unify layout paths and improve efficiency
Unify the keg layout selection paths (keg_small_init, keg_large_init,
keg_cachespread_init), and slightly improve memory efficiecy by:
 - using the padding of the final item to store the slab header,
 - not going OFFPAGE if we have a choice unless it improves efficiency.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23048
2020-01-09 02:03:17 +00:00
Ryan Libby
54c5ae804f uma: reorganize flags
- Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC.
 - Move flag VTOSLAB from public to private.
 - Introduce public NOTPAGE flag and make HASH private.
 - Introduce public NOTOUCH flag and make OFFPAGE private.
 - Update man page.

The net effect of this should be to make the contract with clients more
clear.  Clients should choose constraints, UMA will figure out how to
implement them.  This also breaks the confusing double meaning of
OFFPAGE.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23016
2020-01-09 02:03:03 +00:00
Jeff Roberson
79c9f9429a Fix uma boot pages calculations on NUMA machines that also don't have
MD_UMA_SMALL_ALLOC.  This is unusual but not impossible.  Fix the alignemnt
of zones while here.  This was already correct because uz_cpu strongly
aligned the zone structure but the specified alignment did not match
reality and involved redundant defines.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D23046
2020-01-06 02:51:19 +00:00
Jeff Roberson
bfb6b7a121 The fix in r356353 was insufficient. Not every architecture returns 0 for
EARLY_COUNTER.  Only amd64 seems to.

Suggested by:	markj
Reported by:	lwhsu
Reviewed by:	markj
PR:		243117
2020-01-05 22:54:25 +00:00
Kyle Evans
2180f6c6f1 kern_mmap: restore character deleted in transit
Pointy hat to:	kevans
X-MFC-With:	r356359
2020-01-04 23:51:44 +00:00
Kyle Evans
18348a2369 kern_mmap: add a variant that allows caller to inspect fp
Linux mmap rejects mmap() on a write-only file with EACCES.
linux_mmap_common currently does a fun dance to grab the fp associated with
the passed in fd, validates it, then drops the reference and calls into
kern_mmap(). Doing so is perhaps both fragile and premature; there's still
plenty of chance for the request to get rejected with a more appropriate
error, and it's prone to a race where the file we ultimately mmap has
changed after it drops its referenced.

This change alleviates the need to do this by providing a kern_mmap variant
that allows the caller to inspect the fp just before calling into the fileop
layer. The callback takes flags, prot, and maxprot as one could imagine
scenarios where any of these, in conjunction with the file itself, may
influence a caller's decision.

The file type check in the linux compat layer has been removed; EINVAL is
seemingly not an appropriate response to the file not being a vnode or
device. The fileop layer will reject the operation with ENODEV if it's not
supported, which more closely matches the common linux description of
mmap(2) return values.

If we discover that we're allowing an mmap() on a file type that Linux
normally wouldn't, we should restrict those explicitly.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22977
2020-01-04 23:39:58 +00:00
Jeff Roberson
31c251a046 Fix an assertion introduced in r356348. On architectures without
UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that
violated the new assert.  Resolve this by rewriting the COLD asserts to
look at the per-cpu allocation counts for evidence of api activity.

Discussed with:	rlibby
Reviewed by:	markj
Reported by:	lwhsu
2020-01-04 19:29:25 +00:00
Jeff Roberson
dfe13344f5 UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names
more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and
UMA_ZONE_ROUNDROBIN.  The system will now pick a select a default depending
on kernel configuration.  API users need only specify one if they want to
override the default.

Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off
of NUMA.  XDOMAIN is now fast enough in all cases to enable whenever NUMA
is.

Reviewed by:	markj
Discussed with:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22831
2020-01-04 18:48:13 +00:00
Jeff Roberson
91d947bfbe Sort cross-domain frees into per-domain buckets before inserting these
onto their respective bucket lists.  This is a several order of magnitude
improvement in contention on the keg lock under heavy free traffic while
requiring only an additional bucket per-domain worth of memory.

Discussed with:		markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22830
2020-01-04 07:56:28 +00:00
Jeff Roberson
8b987a7769 Use per-domain keg locks. This provides both a lock and separate space
accounting for each NUMA domain.  Independent keg domain locks are important
with cross-domain frees.  Hashed zones are non-numa and use a single keg
lock to protect the hash table.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22829
2020-01-04 03:30:08 +00:00
Jeff Roberson
727c691857 Use a separate lock for the zone and keg. This provides concurrency
between populating buckets from the slab layer and fetching full buckets
from the zone layer.  Eliminate some nonsense locking patterns where
we lock to fetch a single variable.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22828
2020-01-04 03:15:34 +00:00
Jeff Roberson
4bd61e19a2 Use atomics for the zone limit and sleeper count. This relies on the
sleepq to serialize sleepers.  This patch retains the existing sleep/wakeup
paradigm to limit 'thundering herd' wakeups.  It resolves a missing wakeup
in one case but otherwise should be bug for bug compatible.  In particular,
there are still various races surrounding adjusting the limit via sysctl
that are now documented.

Discussed with:	markj
Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22827
2020-01-04 03:04:46 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Mark Johnston
f7607c300b Clear queue operation flags when migrating a page to another queue.
The page daemon loops may move pages back to the active queue if
references are detected.  In this case we must take care to clear
existing queue operation flags.  In particular, PGA_REQUEUE_HEAD may be
set, and that flag is only valid if the page belongs to the inactive
queue.

Also fix a bug in the active queue scan where we were updating "old"
instead of "new".  This would only have been hit in rare cases where the
page moved out of the active queue after the beginning of the scan.

Reported by:	Bob Prohaska, Idwer Vollering
Tested by:	Idwer Vollering
Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D23001
2020-01-02 19:26:04 +00:00
Doug Moore
668a8aa83b The map-entry clipping functions modify start and end entries of an
entry in the vm_map, making invariants related to the max_free entry
field invalid. Move the clipping work into vm_map_entry_link, so that
linking is okay when the new entry clips a current entry, and the
vm_map doesn't have to be briefly corrupted. Change assertions and
conditions in SPLAY_{LEFT,RIGHT}_STEP since the max_free invariants
can now be trusted in all cases.

Tested by:	pho
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D22897
2019-12-31 22:20:54 +00:00
Mark Johnston
758b2c02bb Restore a vm_page_wired() check in vm_page_mvqueue() after r356156.
We now set PGA_DEQUEUE on a managed page when it is wired after
allocation, and vm_page_mvqueue() ignores pages with this flag set,
ensuring that they do not end up in the page queues.  However, this is
not sufficient for managed fictitious pages or pages managed by the
TTM.  In particular, the TTM makes use of the plinks.q queue linkage
fields for its own purposes.

PR:	242961
Reported and tested by:	Greg V <greg@unrelenting.technology>
2019-12-29 20:01:03 +00:00
Mark Johnston
9b888dd9bd Clear queue op flags in vm_page_mvqueue().
This fixes a regression in r356155, introduced at the last minute.  In
particular, we must clear PGA_REQUEUE_HEAD before inserting into any
queue besides PQ_INACTIVE since that operation is implemented only for
PQ_INACTIVE.

Reported by:	pho, Jenkins via lwhsu
2019-12-29 15:39:43 +00:00
Mark Johnston
727150ff03 Remove some unused functions.
The previous series of patches orphaned some vm_page functions, so
remove them.

Reviewed by:	dougm, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22886
2019-12-28 19:04:29 +00:00
Mark Johnston
dc71caa037 Update the vm_page.h block comment to reflect recent changes.
Explain the new locking rules for per-page queue state updates.

Reviewed by:	jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22884
2019-12-28 19:04:15 +00:00
Mark Johnston
9f5632e6c8 Remove page locking for queue operations.
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page.  It is also no longer needed in
the page queue scans.  This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22885
2019-12-28 19:04:00 +00:00
Mark Johnston
b7f30bff2f Generalize lazy dequeue logic for wired pages.
Some recent work aims to remove the use of the page lock for
synchronizing updates to page queue state.  This change adds a mechanism
to preserve the existing behaviour of lazily dequeuing wired pages,
which was previously synchronized using the page lock.

Handle this by setting PGA_DEQUEUE when a managed page's wire count
transitions from 0 to 1.  When the page daemon encounters a page with a
flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that
page, but in so doing it does not modify the page itself and thus racing
with a concurrent free of the page is harmless.  The flag is advisory;
the page daemon still checks for wirings after acquiring the object and
page xbusy locks.

vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition.
It must do this before dropping the reference to avoid a use-after-free
but also handles races with concurrent wirings to ensure that
PGA_DEQUEUE is not left unset on a wired page.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22882
2019-12-28 19:03:46 +00:00
Mark Johnston
f3f38e2580 Start implementing queue state updates using fcmpset loops.
This is in preparation for eliminating the use of the vm_page lock for
protecting queue state operations.

Introduce the vm_page_pqstate_commit_*() functions.  These functions act
as helpers around vm_page_astate_fcmpset() and are specialized for
specific types of operations.  vm_page_pqstate_commit() wraps these
functions.

Convert a number of routines to use these new helpers.  Use
vm_page_release_toq() in vm_page_unwire() and vm_page_release() to
atomically release a wiring reference and release the page into a queue.
This has the side effect that vm_page_unwire() will leave the page in
the active queue if it is already present there.

Convert the page queue scans to use the new helpers.  Simplify
vm_pageout_reinsert_inactive(), which requeues pages that were found to
be busy during an inactive queue scan, to avoid duplicating the work of
vm_pqbatch_process_page().  In particular, if PGA_REQUEUE or
PGA_REQUEUE_HEAD is set, let that be handled during batch processing.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22770
Differential Revision:	https://reviews.freebsd.org/D22771
Differential Revision:	https://reviews.freebsd.org/D22772
Differential Revision:	https://reviews.freebsd.org/D22773
Differential Revision:	https://reviews.freebsd.org/D22776
2019-12-28 19:03:32 +00:00
Mark Johnston
3c01c56b0e Don't update per-page activation counts in the swapout code.
This avoids duplicating the work of the page daemon's active queue scan.
Moreover, this duplication was inconsistent:
- PGA_REFERENCED is not counted in act_count unless pmap_ts_referenced()
  returned 0, but the page daemon always counts PGA_REFERENCED towards
  the activation count.
- The swapout daemon always activates a referenced page, but the page
  daemon only does so when the containing object is mapped at least
  once.

The main purpose of swapout_deactivate_pages() is to shrink the number
of pages mapped into a given pmap.  To do this without unmapping active
pages, use the non-destructive pmap_is_referenced() instead of the
destructive pmap_ts_referenced() and deactivate pages accordingly.
This simplifies some future changes to the locking protocol for page
queue state.

Reviewed by:	kib
Discussed with:	jeff
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22674
2019-12-28 19:03:17 +00:00
Konstantin Belousov
df8db6ddb9 vm_object_shadow(): fix object reference leak.
In r355270 by me, vm_object_shadow() was changed to handle the
reference counting for the shared case, but the extra reference that
was done in vmspace_fork() for the shared/need_copy case was not
removed.

Submitted by:	jeff
2019-12-28 16:40:44 +00:00
Mark Johnston
5541eb27d6 Remove some stale comments from the page allocator.
Since r352110 the page lock is not required to wire pages in any
context.
2019-12-27 23:19:21 +00:00
Jeff Roberson
ff5ce8a7a5 Fix a pair of bugs introduced in r356002. When we reclaim physical pages we
allocate them with VM_ALLOC_NOOBJ which means they are not busy.  For now
move the busy assert for the new page in vm_page_replace into the public
api and out of the private api used by contig reclaim.  Fix another issue
where we would leak busy if the page could not be removed from pmap.

Reported by:	pho
Discussed with:	markj
2019-12-27 01:50:16 +00:00
Jeff Roberson
cc7ce83ae0 Further reduce the cacheline footprint of fast allocations by duplicating
the zone size and flags fields in the per-cpu caches.  This allows fast
alloctions to proceed only touching the single per-cpu cacheline and
simplifies the common case when no ctor/dtor is specified.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22826
2019-12-25 20:57:24 +00:00
Jeff Roberson
376b1ba394 Optimize fast path allocations by storing bucket headers in the per-cpu
cache area.  This allows us to check on bucket space for all per-cpu
buckets with a single cacheline access and fewer branches.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22825
2019-12-25 20:50:53 +00:00
Jeff Roberson
3639ac42e5 Fix a bug with _NUMA domains introduced in r339686. When M_NOWAIT is
specified there was no loop termination condition in keg_fetch_slab().

Reported by:	pho
Reviewed by:	markj
2019-12-25 19:26:35 +00:00
Jeff Roberson
7e1b379e1e Don't unnecessarily relock the vm object after sleeps. This results in a
surprising amount of object contention on loop restarts in fault.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22821
2019-12-24 18:38:06 +00:00
Doug Moore
b649c2ac34 Fix typo using RB_INITIALIZER.
The macro RB_INITIALIZER ignores its argument, but is documented to
require "&head" as argument to initialize "head".  So using
"_vm_phys_fictitious_tree" as the argument to initialize
"vm_phys_fictitious_tree" is an inconsequential error, corrected here.

Discussed with: alc
2019-12-22 21:53:05 +00:00
Jeff Roberson
419f0b1f95 Fix a bug introduced in r356002. Prior versions of this patchset had
vm_page_remove() rather than !vm_page_wired() as the condition for free.
When this changed back to wired the busy lock was leaked.

Reported by:	pho
Reviewed by:	markj
2019-12-22 20:35:50 +00:00
Jeff Roberson
3cf3b4e641 Make page busy state deterministic on free. Pages must be xbusy when
removed from objects including calls to free.  Pages must not be xbusy
when freed and not on an object.  Strengthen assertions to match these
expectations.  In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.

Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state.  This removes redundant and
potentially problematic code that has proliferated.

Discussed with:	markj
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D22822
2019-12-22 06:56:44 +00:00
Jeff Roberson
bef91632da Move vm_fault busy logic into its own function for clarity and re-use by
later changes.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22820
2019-12-22 04:21:16 +00:00
Mark Johnston
d07c571806 Fix VPO_UNMANAGED handling in vm_page_reclaim_run() after r353540.
When allocating a replacement page we must clear VPO_UNMANAGED since we
only ever reclaim pages from managed objects.  vm_page_replace() does
not handle this for us.

Sprinkle some assertions to help catch this sort of issue.

Reported by:	pho
Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22868
2019-12-21 19:04:05 +00:00
Mark Johnston
c2f22e9790 Fix the aflag shift on big-endian platforms after r355672.
The structure offset is zero regardless of endianness.

Reported by:	brooks
Pointy hat:	markj
2019-12-18 01:56:38 +00:00
Jeff Roberson
61a74c5ccd schedlock 1/4
Eliminate recursion from most thread_lock consumers.  Return from
sched_add() without the thread_lock held.  This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks.  This will eventually allow for lockless remote adds.

Discussed with:	kib
Reviewed by:	jhb
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22626
2019-12-15 21:11:15 +00:00
Jeff Roberson
4bf95d00ce Previously we did not support invalid pages in default objects. This means
that if fault fails to progress and needs to restart the loop it must free
the page it is working on and allocate again on restart.  Resolve the few
places that need to be modified to support this condition and simply
deactivate the page.  Presently, we only permit this when fault restarts
for busy contention.  This has an added benefit of removing some object
trylocking in this case.

While here consolidate some page cleanup logic into fault_page_free() and
fault_page_release() to reduce redundant code and automate some teardown.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D22653
2019-12-15 04:08:24 +00:00
Jeff Roberson
a808177864 Add a deferred free mechanism for freeing swap space that does not require
an exclusive object lock.

Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy.  This may be
done inconsistently and requires the object lock which is not always
convenient.

Instead, track when swap space is present.  The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.

Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.

Discussed with:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22654
2019-12-15 03:15:06 +00:00
Jeff Roberson
d966c7615f Slightly optimize locking in vm_map_copy_swap_entry(). Anonymous objects
require the object lock to synchronize collapse.  Other swap objects such
as tmpfs do not.

Reported by:	mjg
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22747
2019-12-15 02:02:27 +00:00
Jeff Roberson
af00971419 Handle pagein clustering in vm_page_grab_valid() so that it can be used by
exec_map_first_page().  This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).

Discussed with:	alc
Approved by:	kib
Differential Revision:	https://reviews.freebsd.org/D22731
2019-12-15 02:00:32 +00:00
Ryan Libby
815db20425 uma dbg: flexible size for slab debug bitset too
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative.  Now, make the debug bitset
size dynamic too.  The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by:	jeff (previous version), markj (previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22759
2019-12-14 05:21:56 +00:00
Mark Johnston
325c4ced0d Restore the reservation of boot pages for bucket zones after r355707.
uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(),
but before that assignment, startup_alloc() will use pages from the
reserved pool, so the bucket zones themselves are still allocated using
startup pages.

Reviewed by:	rlibby
Reported by:	Jenkins via lwhsu
Differential Revision:	https://reviews.freebsd.org/D22797
2019-12-13 18:28:01 +00:00
Ryan Libby
d82c8ffb16 Revert r355706 & r355710
The quick fix didn't work.  I'll sort it out tomorrow.

Revert r355710: "libmemstat: unbreak build"
Revert r355706: "uma dbg: flexible size for slab debug bitset too"
2019-12-13 11:21:28 +00:00
Ryan Libby
f7af501519 uma: report slab efficiency
Reviewed by:	jeff
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22766
2019-12-13 09:32:09 +00:00
Ryan Libby
3182660a85 uma: delay bucket_init() until we might actually enable buckets
This helps with a bootstrapping problem in upcoming work.

We don't first enable buckets until uma_startup2(), so we can delay
bucket creation until then.  The other two paths to bucket_enable() are
both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM)
and one in uma_timeout() (first activated in uma_startup3()).  Note that
although some bucket functions are accessible before uma_startup2()
(e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone.

Discussed with:	jeff
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22765
2019-12-13 09:32:03 +00:00
Ryan Libby
7508f15ff1 uma dbg: flexible size for slab debug bitset too
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative.  Now, make the debug bitset
size dynamic too.  The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22759
2019-12-13 09:31:59 +00:00
Mark Johnston
cbc080b4c4 Avoid relying on silent type casting in the native atomic_load_32.
Reported by:	np
2019-12-12 23:55:34 +00:00
Mark Johnston
6fbaf6859c Implement atomic state updates using the new vm_page_astate_t structure.
Introduce primitives vm_page_astate_load() and vm_page_astate_fcmpset()
to operate on the 32-bit per-page atomic state.  Modify
vm_page_pqstate_fcmpset() to use them.  No functional change intended.

Introduce PGA_QUEUE_OP_MASK, a subset of PGA_QUEUE_STATE_MASK that only
includes queue operation flags.  This will be used in subsequent
patches.

Reviewed by:	alc, jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22753
2019-12-12 21:13:20 +00:00
Doug Moore
037c0994bf Extract code common to _vm_map_clip_start and _vm_map_clip_end into a
function, vm_map_entry_clone, that can be invoked by each.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22760
2019-12-11 16:09:57 +00:00
Ryan Libby
6d204a6a0e uma: pretty print zone flags sysctl
Requested by:	jeff
Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22748
2019-12-11 06:50:55 +00:00
Mark Johnston
9be9ea420e Add a helper function to the swapout daemon's deactivation code.
vm_swapout_object_deactivate_pages() is renamed to
vm_swapout_object_deactivate(), and the loop body is moved into the new
vm_swapout_object_deactivate_page().  This makes the code a bit easier
to follow and is in preparation for some functional changes.

Reviewed by:	jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22651
2019-12-10 18:15:20 +00:00
Mark Johnston
5cff1f4dc3 Introduce vm_page_astate.
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state.  The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields.  No functional change intended.

Reviewed by:	alc, jeff, kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22650
2019-12-10 18:14:50 +00:00
Doug Moore
7887f00d2b Revert r355505. The code that it allowed to compile has been removed. 2019-12-09 05:09:46 +00:00