4424 Commits

Author SHA1 Message Date
Mateusz Guzik
a92a971bbb vfs: remove the thread argument from vget
It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)
2020-08-16 17:18:54 +00:00
Conrad Meyer
ea7b737a6f vm_pageout: Correct threshold calculation on single-CPU systems
Reported by:	Michael Butler
X-MFC-With:	r364129
2020-08-14 18:48:48 +00:00
Conrad Meyer
b7883452d4 Back out unrelated change
Reported by:	kib, markj
X-MFC-With:	r364129
2020-08-12 00:21:30 +00:00
Conrad Meyer
0292c54bdb Add support for multithreading the inactive queue pageout within a domain.
In very high throughput workloads, the inactive scan can become overwhelmed
as you have many cores producing pages and a single core freeing.  Since
Mark's introduction of batched pagequeue operations, we can now run multiple
inactive threads working on independent batches.

To avoid confusing the pid and other control algorithms, I (Jeff) do this in
a mpi-like fan out and collect model that is driven from the primary page
daemon.  It decides whether the shortfall can be overcome with a single
thread and if not dispatches multiple threads and waits for their results.

The heuristic is based on timing the pageout activity and averaging a
pages-per-second variable which is exponentially decayed. This is visible in
sysctl and may be interesting for other purposes.

I (Jeff) have verified that this does indeed double our paging throughput
when used with two threads. With four we tend to run into other contention
problems.  For now I would like to commit this infrastructure with only a
single thread enabled.

The number of worker threads per domain can be controlled with the
'vm.pageout_threads_per_domain' tunable.

Submitted by:	jeff (earlier version)
Discussed with:	markj
Tested by:	pho
Sponsored by:	probably Netflix (based on contemporary commits)
Differential Revision:	https://reviews.freebsd.org/D21629
2020-08-11 20:37:45 +00:00
Mark Johnston
af32cefd7c Check the UMA zone's full bucket cache before short-circuiting an alloc.
The global "bucketdisable" flag indicates that we are in a low memory
situation and should avoid allocating buckets.  However, in the
allocation path we were checking it before the full bucket cache and
bailing even if the cache is non-empty.  Defer the check so that we have
a shot at allocating from the cache.

This came up because M_NOWAIT allocations from the buf trie node zone
must always succeed.  In one scenario, all of the preallocated trie
nodes were in the bucket list, and a new slab allocation could not
succeed due to a memory shortage.  The short-circuiting caused an
allocation failure which triggered a panic.

Reported by:	pho
Reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25980
2020-08-10 20:34:45 +00:00
Brooks Davis
9f9cc3f989 Preserve ASLR vm_map flags across fork
In the most common case (fork+execve) this doesn't matter, but further
attempts to apply entropy would fail in (e.g.) a pre-fork server.

Reported by:	Alfredo Mazzinghi
Reviewed by:	kib, markj
Obtained from:	CheriBSD
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D25966
2020-08-06 16:20:20 +00:00
Mark Johnston
efec381dd1 Remove most lingering references to the page lock in comments.
Finish updating comments to reflect new locking protocols introduced
over the past year.  In particular, vm_page_lock is now effectively
unused.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25868
2020-08-04 14:59:43 +00:00
Mark Johnston
96ad26eefb Remove free_domain() and uma_zfree_domain().
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches.  They no longer serve any
purpose since UMA now provides their functionality by default.  Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by:	cem, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25937
2020-08-04 13:58:36 +00:00
Mark Johnston
958d8f527c Remove the volatile qualifier from busy_lock.
Use atomic(9) to load the lock state.  Some places were doing this
already, so it was inconsistent.  In initialization code, the lock state
is still initialized with plain stores.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25861
2020-07-29 19:38:49 +00:00
Mark Johnston
f72e5be58a vm_page_xbusy_claim(): Use atomics to update busy lock state.
vm_page_xbusy_claim() could clobber the waiter bit.  For its original
use, kernel memory pages, this was not a problem since nothing would
ever block on the busy lock for such pages.  r363607 introduced a new
use where this could in principle be a problem.

Fix the problem by using atomic_cmpset to update the lock owner.  Since
this macro is defined only for INVARIANTS kernels the extra overhead
doesn't seem prohibitive.

Reported by:	vangyzen
Reviewed by:	alc, kib, vangyzen
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25859
2020-07-28 19:50:39 +00:00
Mark Johnston
782ebde52e vm_page_free_invalid(): Relax the xbusy assertion.
vm_page_assert_xbusied() asserts that the busying thread is the current
thread.  For some uses of vm_page_free_invalid() (e.g., error handling
in vnode_pager_generic_getpages_done()), this condition might not hold.

Reported by:	Jenkins via trasz
Reviewed by:	chs, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25828
2020-07-27 14:25:10 +00:00
Doug Moore
00fd73d2da Fix an overflow bug in the blist allocator that needlessly capped max
swap size by dividing a value, which was always a multiple of 64, by
64.  Remove the code that reduced max swap size down to that cap.

Eliminate the distinction between BLIST_BMAP_RADIX and
BLIST_META_RADIX.  Call them both BLIST_RADIX.

Make improvments to the blist self-test code to silence compiler
warnings and to test larger blists.

Reported by:	jmallett
Reviewed by:	alc
Discussed with:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D25736
2020-07-25 18:29:10 +00:00
Mateusz Guzik
ee74412269 vm: fix swap reservation leak and clean up surrounding code
The code did not subtract from the global counter if per-uid reservation
failed.

Cleanup highlights:
- load overcommit once
- move per-uid manipulation to dedicated routines
- don't fetch wire count if requested size is below the limit
- convert return type from int to bool
- ifdef the routines with _KERNEL to keep vm.h compilable by userspace

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D25787
2020-07-24 13:23:32 +00:00
Mateusz Guzik
126a2470b9 vm: annotate swap_reserved with __exclusive_cache_line
The counter keeps being updated all the time and variables read afterwards
share the cacheline. Note this still fundamentally does not scale and needs
to be replaced, in the meantime gets a bandaid.

brk1_processes -t 52 ops/s:
before: 8598298
after:  9098080
2020-07-23 08:42:16 +00:00
Chuck Silvers
1bd12a3bb2 Fix vnode_pager handling of read ahead/behind pages when a disk read fails.
Rather than marking the read ahead/behind pages valid even though they were
not initialized, free them using the new function vm_page_free_invalid().

Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:10:35 +00:00
Chuck Silvers
4dfa06e114 Add a new function vm_page_free_invalid() for freeing invalid pages
that might be wired.  If the page is wired then it cannot be freed now,
but the thread that eventually unwires it will free it at that point.

Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:09:36 +00:00
Chuck Silvers
c3dbadc1fd Revert my change from r361855 in favor of a better fix.
Reviewed by:	markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25430
2020-07-17 23:08:01 +00:00
Mark Johnston
a7752896f0 Add vm_map_valid_range_KBI().
This is required for standalone module builds.

Reported by:	hselasky
Reviewed by:	dougm, hselasky, kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D25650
2020-07-13 16:39:27 +00:00
Scott Long
ffc568ba8b Revert r362998, r326999 while a better compatibility strategy is devised. 2020-07-09 22:38:36 +00:00
Scott Long
b302c2e5c9 Migrate the feature of excluding RAM pages to use "excludelist"
as its nomenclature.

MFC after:	1 week
2020-07-07 20:33:11 +00:00
Conrad Meyer
8a64110e43 vm: Add missing WITNESS warnings for M_WAITOK allocation
vm_map_clip_{end,start} and lookup_clip_start allocate memory M_WAITOK
for !system_map vm_maps.  Add WITNESS warning annotation for !system_map
callers who may be holding non-sleepable locks.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D25283
2020-06-29 16:54:00 +00:00
Mark Johnston
8c277118d8 Fix UMA's first-touch policy on systems with empty domains.
Suppose a thread is running on a CPU in a NUMA domain with no physical
RAM.  When an item is freed to a first-touch zone, it ends up in the
cross-domain bucket.  When the bucket is full, it gets placed in another
domain's bucket queue.  However, when allocating an item, UMA will
always go to the keg upon a per-CPU cache miss because the empty
domain's bucket queue will always be empty.  This means that a non-empty
domain's bucket queues can grow very rapidly on such systems.  For
example, it can easily cause mbuf allocation failures when the zone
limit is reached.

Change cache_alloc() to follow a round-robin policy when running on an
empty domain.

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25355
2020-06-28 21:35:04 +00:00
Konstantin Belousov
ee06cffcd2 vm_page_free_prep(): correct description of the required page and object state.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25482
2020-06-27 02:31:39 +00:00
Mark Johnston
84242cf68a Call swap_pager_freespace() from vm_object_page_remove().
All vm_object_page_remove() callers, except
linux_invalidate_mapping_pages() in the LinuxKPI, free swap space when
removing a range of pages from an object.  The LinuxKPI case appears to
be an unintentional omission that could result in leaked swap blocks, so
unconditionally free swap space in vm_object_page_remove() to protect
against similar bugs in the future.

Reviewed by:	alc, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25329
2020-06-25 15:21:21 +00:00
Jeff Roberson
c8b0a88b8d Clarify some language. Favor primary where both master and primary were
used in conjunction with secondary.
2020-06-20 20:21:04 +00:00
Edward Tomasz Napierala
52c81be11a Add linux_madvise(2) instead of having Linux apps call the native
FreeBSD madvise(2) directly.  While some of the flag values match,
most don't.

PR:		kern/230160
Reported by:	markj
Reviewed by:	markj
Discussed with:	brooks, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25272
2020-06-20 18:29:22 +00:00
Mark Johnston
cdd02f43b9 Revert r362360.
This commit was simply wrong since two different objects are locked.

Reported by:	lwhsu, pho
Pointy hat:	markj
2020-06-19 11:04:49 +00:00
Mark Johnston
f034074034 Restore a check unintentionally dropped in r362361.
MFC with:	r362361
2020-06-19 04:18:20 +00:00
Mark Johnston
0f1e6ec591 Add a helper function for validating VA ranges.
Functions which take untrusted user ranges must validate against the
bounds of the map, and also check for wraparound.  Instead of having the
same logic duplicated in a number of places, add a function to check.

Reviewed by:	dougm, kib
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D25328
2020-06-19 03:32:04 +00:00
Mark Johnston
61b006887e Fix a double object unlock in vm_object_backing_collapse_wait().
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25327
2020-06-19 03:31:46 +00:00
Conrad Meyer
a116b5d3e4 vm: Drop vm_map_clip_{start,end} macro wrappers
No functional change.

Reviewed by:	dougm, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D25282
2020-06-16 22:53:56 +00:00
Eric van Gyzen
8cc8c5864a Honor db_pager_quit in some vm_object ddb commands
These can be rather verbose.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2020-06-12 21:53:08 +00:00
Mateusz Guzik
7ce3a31286 vm: rework swap_pager_status to execute in constant time
The lock-protected iteration is trivially avoidable.

This removes a serialisation point from Linux binaries (which end up calling
here from the sysinfo syscall).
2020-06-09 14:16:18 +00:00
Chuck Silvers
bd7d64f548 Don't mark pages as valid if reading the contents from disk fails.
Instead, just skip marking pages valid if the read fails.  Future
attempts to access such pages will notice that they are not marked valid
and try to read them from disk again.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25138
2020-06-06 00:47:59 +00:00
Ed Maste
4d13f78444 Correct terminology in vm.imply_prot_max sysctl description
As with r361769 (man page), PROT_* are properly called protections, not
permissions.

MFC after:	1 week
MFC with:	r361769
Sponsored by:	The FreeBSD Foundation
2020-06-04 01:49:29 +00:00
Mateusz Guzik
1c58c09f5a uma: hide item_domain under ifdef NUMA
Fixes build warnings on mips.
2020-05-29 08:30:35 +00:00
Mark Johnston
81302f1d77 Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
  ensure that segments registered during hammer_time() are placed in the
  right domain.  Otherwise, since the SRAT is not parsed at that point,
  we just add them to domain 0, which may be incorrect and results in a
  domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
  If domain 0 is unpopulated, the allocation will simply fail, resulting
  in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
  affinity table, and change vm_phys_early_alloc() to handle wildcard
  domains.  This is necessary on amd64, where the page array is dense
  and pmap_page_array_startup() may allocate page table pages for
  non-existent page frames.

Reported and tested by:	Rafael Kitover <rkitover@gmail.com>
Reviewed by:	cem (earlier version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
Konstantin Belousov
fe0dcc402f Simplify the condition to enable superpage mappings in vm_fault_soft_fast().
The list of arches list there matches the list of arches where
default VM_NRESERVLEVEL > 0.  Before sparc64 removal, that was the
only arch that defined VM_NRESERVLEVEL > 0 to help with cache coloring,
but did not implemented superpages.  Now it can be simplified.

Submitted by:	alc
Reviewed by:	markj
2020-05-27 21:44:26 +00:00
Justin Hibbits
d4ed51f329 Properly sort ifdef archs in vm_fault_soft_fast superpage guards.
Sort broken in r360887.
2020-05-27 01:35:46 +00:00
Mark Johnston
dc2b320563 Allocate UMA per-CPU counters earlier.
Otherwise anything counted before SI_SUB_VM_CONF is discarded.  However,
it is useful to be able to see stats from allocations done early during
boot.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24756
2020-05-14 16:06:54 +00:00
Kyle Evans
c79cee7136 kernel: provide panicky version of __unreachable
__builtin_unreachable doesn't raise any compile-time warnings/errors on its
own, so problems with its usage can't be easily detected. While it would be
nice for this situation to change and compilers to at least add a warning
for trivial cases where local state means the instruction can't be reached,
this isn't the case at the moment and likely will not happen.

This commit adds an __assert_unreachable, whose intent is incredibly clear:
it asserts that this instruction is unreachable. On INVARIANTS builds, it's
a panic(), and on non-INVARIANTS it expands to  __unreachable().

Existing users of __unreachable() are converted to __assert_unreachable,
to improve debuggability if this assumption is violated.

Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D23793
2020-05-13 18:07:37 +00:00
Justin Hibbits
65bbba25d2 powerpc64: Implement Radix MMU for POWER9 CPUs
Summary:
POWER9 supports two MMU formats: traditional hashed page tables, and Radix
page tables, similar to what's presesnt on most other architectures.  The
PowerISA also specifies a process table -- a table of page table pointers--
which on the POWER9 is only available with the Radix MMU, so we can take
advantage of it with the Radix MMU driver.

Written by Matt Macy.

Differential Revision: https://reviews.freebsd.org/D19516
2020-05-11 02:33:37 +00:00
Mark Johnston
a9ea09e548 Re-check for wirings after busying the page in vm_page_release_locked().
A concurrent unlocked lookup can wire the page after
vm_page_release_locked() releases the last wiring, in which case
vm_page_release_locked() must not free the page.  Once the xbusy lock is
acquired, that, the object lock and the fact that the page is unmapped
ensure that the wire count cannot increase, so re-check for new wirings
after the page is xbusied.

Update the comment above vm_page_wired() to reflect the new
synchronization rules.

Reported by:	glebius
Reviewed by:	alc, jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24592
2020-04-28 13:51:41 +00:00
Mark Johnston
f13fa9df05 Use a single VM object for kernel stacks.
Previously we allocated a separate VM object for each kernel stack.
However, fully constructed kernel stacks are cached by UMA, so there is
no harm in using a single global object for all stacks.  This reduces
memory consumption and makes it easier to define a memory allocation
policy for kernel stack pages, with the aim of reducing physical memory
fragmentation.

Add a global kstack_object, and use the stack KVA address to index into
the object like we do with kernel_object.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24473
2020-04-26 20:08:57 +00:00
Mark Johnston
33655d9546 Factor out the kmem contig page alloc and reclamation code.
kmem_alloc_attr_domain() and kmem_alloc_contig_domain() duplicated each
other's page allocation and reclamation logic.  Place it in a single
function to make it easier to add additional consumers.  No functional
change intended.

Reviewed by:	jeff, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24475
2020-04-21 16:01:44 +00:00
Mark Johnston
303b77029b Minimize conditional compilation for handling of M_EXEC.
This simplifies some planned changes.  No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24474
2020-04-21 15:55:28 +00:00
Mark Johnston
70e68b19a4 Handle trashed queue pointers in vm_page_acquire_unlocked().
vm_page_acquire_unlocked() relies on type-stability of vm_page
structures and assumes that the listq linkage pointers always point to a
vm_page or are NULL.  QUEUE_MACRO_DEBUG_TRASH breaks that assumption, so
add an explicit check for a trashed queue pointer before dereferencing.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24472
2020-04-20 14:45:17 +00:00
Bryan Drewery
adc0388117 Remove dead code leftover from r331018.
Sponsored by:	Dell EMC
2020-03-31 01:12:53 +00:00
Konstantin Belousov
abfdf76791 VOP_GETPAGES_ASYNC(): consistently call iodone() callback in case of error.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:44:30 +00:00
Konstantin Belousov
a7c55b3e1b ddb show pginfo: print pages reference value in hex.
It is more useful this way after the VPRC_ flags were introduced.

Sponsored by:	The FreeBSD Foundation
2020-03-28 12:21:52 +00:00