Commit Graph

4783 Commits

Author SHA1 Message Date
Doug Moore
0ce7909cd0 vm_phys: add essential segment bounds check
A lower-bound segment check is necessary in vm_phys_alloc_seg_contig.
Add one.

Reported by:	jenkins
Reviewed by:	alc
Fixes:	da92ecbc0d vm_phys: fix seg->end test in alloc_seg_contig
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33945
2022-01-19 00:42:39 -06:00
Doug Moore
da92ecbc0d vm_phys: fix seg->end test in alloc_seg_contig
In vm_phys_alloc_seg_contig, in allocating multiple memory blocks for
a huge allocation, ensure that the end of the allocated range does not
exceed the upper segment limit.

Reorder a couple of checks to improve code layout.

Reviewed by:	alc
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33870
2022-01-18 12:49:09 -06:00
Mark Johnston
46d35d415a fork: Copy the vm_stacktop field into the new vmspace
Fixes:	1811c1e957 ("exec: Reimplement stack address randomization")
Reported by:	pho
Reported by:	syzbot+0446312a51bc13ead834@syzkaller.appspotmail.com
Sponsored by:	The FreeBSD Foundation
2022-01-18 10:51:49 -05:00
Mark Johnston
1811c1e957 exec: Reimplement stack address randomization
The approach taken by the stack gap implementation was to insert a
random gap between the top of the fixed stack mapping and the true top
of the main process stack.  This approach was chosen so as to avoid
randomizing the previously fixed address of certain process metadata
stored at the top of the stack, but had some shortcomings.  In
particular, mlockall(2) calls would wire the gap, bloating the process'
memory usage, and RLIMIT_STACK included the size of the gap so small
(< several MB) limits could not be used.

There is little value in storing each process' ps_strings at a fixed
location, as only very old programs hard-code this address; consumers
were converted decades ago to use a sysctl-based interface for this
purpose.  Thus, this change re-implements stack address randomization by
simply breaking the convention of storing ps_strings at a fixed
location, and randomizing the location of the entire stack mapping.
This implementation is simpler and avoids the problems mentioned above,
while being unlikely to break compatibility anywhere the default ASLR
settings are used.

The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack,
and is re-enabled by default.

PR:		260303
Reviewed by:	kib
Discussed with:	emaste, mw
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33704
2022-01-17 16:12:36 -05:00
Mark Johnston
a04ce833f9 uma: Avoid polling for an invalid SMR sequence number
Buckets in an SMR-enabled zone can legitimately be tagged with
SMR_SEQ_INVALID.  This effectively means that the zone destructor (if
any) was invoked on all items in the bucket, and the contained memory is
safe to reuse.  If the first bucket in the full bucket list was tagged
this way, UMA would unnecessarily poll per-CPU state before attempting
to fetch a full bucket from the list.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2022-01-14 15:38:02 -05:00
Mark Johnston
4a864f624a vm_pageout: Print a more accurate message to the console before an OOM kill
Previously we'd always print "out of swap space."  This can be
misleading, as there are other reasons an OOM kill can be triggered.  In
particular, it's entirely possible to trigger an OOM kill on a system
with plenty of free swap space.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33810
2022-01-14 15:04:21 -05:00
Brooks Davis
0910a41ef3 Revert "syscallarg_t: Add a type for system call arguments"
Missed issues in truss on at least armv7 and powerpcspe need to be
resolved before recommit.

This reverts commit 3889fb8af0.
This reverts commit 1544e0f5d1.
2022-01-12 23:29:20 +00:00
Brooks Davis
1544e0f5d1 syscallarg_t: Add a type for system call arguments
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).

Obtained from:	CheriBSD

Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D33780
2022-01-12 22:51:25 +00:00
Doug Moore
84e2ae64c5 vm_reserv: use enhanced bitstring for popmaps
vm_reserv.c uses its own bitstring implemenation for popmaps. Using
the bitstring_t type from a standard header eliminates the code
duplication, allows some bit-at-a-time operations to be replaced with
more efficient bitstring range operations, and, in
vm_reserv_test_contig, allows bit_ffc_area_at to more efficiently
search for a big-enough set of consecutive zero-bits.

Make bitstring changes improve the vm_reserv code.  Define a bit_ntest
method to test whether a range of bits is all set, or all clear.
Define bit_ff_at and bit_ff_area_at to implement the ffs and ffc
versions with a parameter to choose between set- and clear- bits.
Improve the area_at implementation.  Modify the bit_nset and
bit_nclear implementations to allow code optimization in the cases
when start or end are multiples of _BITSTR_BITS.

Add a few new cases to bitstring_test.

Discussed with:	alc
Reviewed by:	markj
Tested by:	pho (earlier version)
Differential Revision:	https://reviews.freebsd.org/D33312
2022-01-12 11:03:53 -06:00
Mark Johnston
c4a25e0713 vm_pageout: Group sysctl variables together with sysctl definitions
Fix some style bugs while here.  No functional change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33811
2022-01-11 09:27:45 -05:00
Mark Johnston
43b3b8e52d swap_pager: uma_zcreate() doesn't fail
Remove always-false checks for UMA zone creation failure.  No functional
change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33809
2022-01-11 09:27:45 -05:00
Doug Moore
ae13829ddc vm_addr_ok: add power2 invariant check
With INVARIANTS defined, have vm_addr_align_ok and vm_addr_bound_ok
panic when passed an alignment/boundary parameter that is not a power
of two.

Reviewed by:	alc
Suggested by:	kib, se
Differential Revision:	https://reviews.freebsd.org/D33725
2022-01-10 01:17:25 -06:00
Konstantin Belousov
c25a30e255 Dump page tracking no longer needed on mips
Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D33763
2022-01-06 06:00:39 +02:00
Konstantin Belousov
f54882a862 Remove special kstack allocation code for mips.
The arch required two-pages alignment due to single TLB entry caching
two consequtive mappings.

Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D33763
2022-01-06 04:43:56 +02:00
Doug Moore
f76916c095 vm_reserv: #include vm_extern.h explicitly, for arm.
Fixes:	c606ab59e7 vm_extern: use standard address checkers everywhere
2021-12-31 00:40:25 -06:00
Doug Moore
e6930b1c5f vm_phys: convert error back to warning
Move an assignment back to where it was before, to turn the
defined-but-not-used error back into a set-but-not-used warning.

Fixes:	01e115ab83 vm_phys: #include vm_extern
2021-12-31 00:23:46 -06:00
Doug Moore
01e115ab83 vm_phys: #include vm_extern
Arm64 and powerpc don't include vm_extern.h indirectly in vm_phys.c, which
means that for the sake of those architectures, it must be included explicitly.

Also, fix a set-unused warning that jenkins also found.

Reported by:	Jenkins
Fixes:	c606ab59e7 vm_extern: use standard address checkers everywhere
2021-12-30 23:31:18 -06:00
Doug Moore
c606ab59e7 vm_extern: use standard address checkers everywhere
Define simple functions for alignment and boundary checks and use them
everywhere instead of having slightly different implementations
scattered about. Define them in vm_extern.h and use them where
possible where vm_extern.h is included.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D33685
2021-12-30 22:09:08 -06:00
Gleb Smirnoff
841e0a8757 uma: with KTR trace allocs/frees from SMR zones 2021-12-29 23:08:33 -08:00
Gleb Smirnoff
28782f73df uma: with KTR report item being freed in uma_zfree_arg() 2021-12-29 23:08:15 -08:00
Doug Moore
8119cdd38b vm_phys: hide vm_phys_set_pool
It is only called in the file that defines it, so make it static and
remove the declaration from the header.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D33688
2021-12-29 11:17:33 -06:00
John Baldwin
d90e41a154 sys/vm: Use C99 fixed-width integer types.
No functional change.

Reviewed by:	imp, kib, emaste
Differential Revision:	https://reviews.freebsd.org/D33641
2021-12-28 09:43:21 -08:00
Doug Moore
49fd2d51f0 vm_reserv: fix zero-boundary error
Handle specially the boundary==0 case of vm_reserv_reclaim_config,
by turning off boundary adjustment in that case.

Reviewed by:	alc
Tested by:	pho, madpilot
2021-12-26 11:40:27 -06:00
Doug Moore
4bae154fe8 vm_page: Move a comment
fb38b29b56 (page_alloc_br) vm_page: Remove extra test, dup code from page alloc
should have moved a comment block when it moved the function call that followed it.

Move the comment block now.
2021-12-24 16:10:30 -06:00
Doug Moore
0d5fac2872 vm: alloc pages from reserv before breaking it
Function vm_reserv_reclaim_contig breaks a reservation with enough
free space to satisfy an allocation request and returns the free space
to the buddy allocator. Change the function to allocate the request
memory from the reservation before breaking it, and return that memory
to the caller. That avoids a second call to the buddy allocator and
guarantees successful allocation after breaking the reservation, where
that success is not currently guaranteed.

Reviewed by:	alc, kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D33644
2021-12-24 12:59:16 -06:00
Doug Moore
184c63db3c Fix clerical error in page alloc
Fix a very recent change that introduced a page accounting error in
case of a reserveration being broken.
Reviewed by:	alc
Fixes:	fb38b29b56 (page_alloc_br) vm_page: Remove extra test, dup code from page alloc
Differential Revision:	https://reviews.freebsd.org/D33645
2021-12-24 02:47:21 -06:00
Doug Moore
fb38b29b56 vm_page: Remove extra test, dup code from page alloc
Extract code common to functions vm_page_alloc_contig_domain and
vm_page_alloc_noobj_contig_domain into a new function.  Do so in a way
that eliminates a bound-to-fail reservation test after a reservation
is broken by a call from vm_page_alloc_contig_domain.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D33551
2021-12-23 22:45:47 -06:00
Stephen J. Kiernan
18048b6e3c Eliminate key press requirement "show vmopag" command output.
Summary:
One was required to press a key to continue after every 18 lines of
output. This requirement had been in the "show vmopag" command since it
was introduced, which was many years before paging was added to DDB.
With paging, this explict key check is no longer necessary.

Obtained from:	Juniper Networks, Inc.
MFC after:	1 week

Test Plan:
Run "show vmopag" from db> prompt and see that it does not need additional
keypresses other than the ones needed for the pager.

Subscribers: imp, #contributor_reviews_base

Differential Revision: https://reviews.freebsd.org/D33550
2021-12-19 19:40:52 -05:00
Rick Macklem
cd37afd8b6 vm_object: Make is_object_active() global
Commit 867c27c23a modified the NFS client so that
it does IO_APPEND writes directly to the NFS server,
bypassing the buffer cache.  However, this could result
in stale data in client pages when the file is mmap(2)'d.
As such, the NFS client needs to call is_object_active()
to check if the file is mmap(2)'d.

This patch renames is_object_active() to vm_object_is_active(),
moves it to sys/vm/vm_object.c and makes it global, so that
the NFS client can call it in a future commit.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33520
2021-12-19 16:11:44 -08:00
Doug Moore
f7aa44763d Correct type size format error in KASSERT.
Reported by:	jenkins
Fixes:	6f1c890827 vm: Don't break vm reserv that can't meet align reqs
2021-12-16 13:48:58 -06:00
Doug Moore
6f1c890827 vm: Don't break vm reserv that can't meet align reqs
Function vm_reserv_test_contig has incorrectly used its alignment
and boundary parameters to find a well-positioned range of empty pages
in a reservation.  Consequently, a reservation could be broken
mistakenly when it was unable to provide a satisfactory set of pages.

Rename the function, correct the errors, and add assertions to detect
the error in case it appears again.

Reviewed by:	alc, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33344
2021-12-16 12:20:56 -06:00
Mark Johnston
88642d978a vm_fault: Fix vm_fault_populate()'s handling of VM_FAULT_WIRE
vm_map_wire() works by calling vm_fault(VM_FAULT_WIRE) on each page in
the rage.  (For largepage mappings, it calls vm_fault() once per large
page.)

A pager's populate method may return more than one page to be mapped.
If VM_FAULT_WIRE is also specified, we'd wire each page in the run, not
just the fault page.  Consider an object with two pages mapped in a
vm_map_entry, and suppose vm_map_wire() is called on the entry.  Then,
the first vm_fault() would allocate and wire both pages, and the second
would encounter a valid page upon lookup and wire it again in the
regular fault handler.  So the second page is wired twice and will be
leaked when the object is destroyed.

Fix the problem by modify vm_fault_populate() to wire only the fault
page.  Also modify the error handler for pmap_enter(psind=1) to not test
fs->wired, since it must be false.

PR:		260347
Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33416
2021-12-14 15:10:46 -05:00
Konstantin Belousov
5346570276 swapoff: add one more variant of the syscall
Requested and reviewed by:	brooks
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33343
2021-12-09 02:48:46 +02:00
Doug Moore
9f32cb5b1c Set uninitialized popmap bits in vm_reserv_init
In vm_reserv_init, set all the marker popmap bits in vm_reserv_init,
and not just the bits of the first popmap entry.

Reviewed by:	markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33258
2021-12-05 17:17:25 -06:00
Gleb Smirnoff
2cb67bd798 uma: remove unused *item argument from cache_free()
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D33272
2021-12-05 10:44:47 -08:00
Mark Johnston
39a7396f5d vm_page: Tighten the object lock assertion in vm_page_invalid()
A page must not become invalid while vm_fault_soft_fast() is attempting
to map unbusied pages for reading.

Note that all callers hold the object write lock already, and
vm_page_set_invalid() asserts the object write lock.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33250
2021-12-05 10:51:11 -05:00
Konstantin Belousov
e8dc2ba29c swapoff(2): add a SWAPOFF_FORCE flag
The flag requests skipping the heuristic which tries to avoid leaving
system with more allocated memory than available from RAM and remanining
swap.

Reviewed by:	markj
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33165
2021-12-05 00:20:58 +02:00
Konstantin Belousov
a4e4132fa3 swapoff(2): replace special device name argument with a structure
For compatibility, add a placeholder pointer to the start of the
added struct swapoff_new_args, and use it to distinguish old vs. new
style of syscall invocation.

Reviewed by:	markj
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33165
2021-12-05 00:20:58 +02:00
Konstantin Belousov
6df359449f swap_pager.c: Remove MPSAFE and ARGSUSED annotations
Reviewed by:	markj
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33165
2021-12-05 00:20:58 +02:00
Konstantin Belousov
0190c38b9d swapoff_one(): only check free pages count manually turning swap off
When swap is turned off due to system shutdown or reboot, ignore the
check.  Problem is that the check is not accurate by any means, free
page count can legitimately be low while system still able to page in
everything from the swap.  Then, we turn swap off if swapping on
real file or some non-standard geom provider, and typically panic
when system appears to actually need to unavailable page.

For syscall, it is better to be safe than sorry.

Reported and tested by:	peterj
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33147
2021-11-29 18:38:02 +02:00
Mateusz Guzik
7e1d3eefd4 vfs: remove the unused thread argument from NDINIT*
See b4a58fbf64 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.
2021-11-25 22:50:42 +00:00
Konstantin Belousov
b19740f4ce swap_pager: lock vnode in swapdev_strategy()
VOP_STRATEGY() requires locked vnode.  Note that we lock the swap vnode
while pages are busy, but this would only cause real LoR if pages belong
to the swap vnode, which must not be the case for correct use.

Reported and tested by:	peterj
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33119
2021-11-25 21:34:50 +02:00
Konstantin Belousov
6ddf41faa6 swapon: extend the region where the swap vnode is locked
to cover VOP_GETATTR() call in sys_swapon().  Move locking from inside
swapongeom() and swaponvp() into sys_swapon().

Reported by and tested by:	peterj
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33119
2021-11-25 21:34:44 +02:00
Konstantin Belousov
a6d04f34a4 swap pager: lock vnode around VOP_CLOSE()
Reported and tested by:	peterj
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33119
2021-11-25 21:34:39 +02:00
Mark Johnston
d47d3a94bb vm_fault: Factor out per-object operations into vm_fault_object()
No functional change intended.

Obtained from:	jeff (object_concurrency patches)
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33018
2021-11-24 14:02:56 -05:00
Mark Johnston
f1b642c255 vm_fault: Introduce a fault_status enum for internal return types
Rather than overloading the meanings of the Mach statuses, introduce a
new set for use internally in the fault code.  This makes the control
flow easier to follow and provides some extra error checking when a
fault status variable is used in a switch statement.

vm_fault_lookup() and vm_fault_relookup() continue to use Mach statuses
for now, as there isn't much benefit to converting them and they
effectively pass through a status from vm_map_lookup().

Obtained from:	jeff (object_concurrency patches)
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33017
2021-11-24 14:02:55 -05:00
Mark Johnston
45c09a74d6 vm_fault: Move nera into faultstate
This makes it easier to factor out pieces of vm_fault().  No functional
change intended.

Obtained from:	jeff (object_concurrency patches)
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33016
2021-11-24 14:02:55 -05:00
Mitchell Horne
10fe6f80a6 minidump: Use the provided dump bitset
When constructing the set of dumpable pages, use the bitset provided by
the state argument, rather than assuming vm_page_dump invariably. For
normal kernel minidumps this will be a pointer to vm_page_dump, but when
dumping the live system it will not.

To do this, the functions in vm_dumpset.h are extended to accept the
desired bitset as an argument. Note that this provided bitset is assumed
to be derived from vm_page_dump, and therefore has the same size.

Reviewed by:	kib, markj, jhb
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D31992
2021-11-19 15:05:52 -04:00
Brooks Davis
01ce7fca44 ommap: fix signed len and pos arguments
4.3 BSD's mmap took an int len and long pos.  Reject negative lengths
and in freebsd32 sign-extend pos correctly rather than mis-handling
negative positions as large positive ones.

Reviewed by:	kib
2021-11-15 18:34:28 +00:00
Mark Johnston
d28af1abf0 vm: Add a mode to vm_object_page_remove() which skips invalid pages
This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks.  In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().

Add a helper, vn_pages_remove_valid(), for use by filesystems.  Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.

PR:		258208
Reviewed by:	avg, kib, sef
Tested by:	pho (part of a larger patch)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32931
2021-11-15 13:01:30 -05:00
Mark Johnston
a2665158d0 vm_page: Remove vm_page_sbusy() and vm_page_xbusy()
They are unused today and cannot be safely used in the face of unlocked
lookup, in which pages may be busied without the object lock held.

Obtained from:	jeff (object_concurrency patches)
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D32948
2021-11-15 13:01:30 -05:00
Mark Johnston
87b646630c vm_page: Consolidate page busy sleep mechanisms
- Modify vm_page_busy_sleep() and vm_page_busy_sleep_unlocked() to take
  a VM_ALLOC_* flag indicating whether to sleep on shared-busy, and fix
  up callers.
- Modify vm_page_busy_sleep() to return a status indicating whether the
  object lock was dropped, and fix up callers.
- Convert callers of vm_page_sleep_if_busy() to use vm_page_busy_sleep()
  instead.
- Remove vm_page_sleep_if_(x)busy().

No functional change intended.

Obtained from:	jeff (object_concurrency patches)
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D32947
2021-11-15 13:01:30 -05:00
Mark Johnston
b0acc3f11b vm_pager: Optimize an assertion
Obtained from:	jeff (object_concurrency patches)
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32946
2021-11-15 13:01:30 -05:00
Mark Johnston
e4bdb6857a vm_page: Handle VM_ALLOC_NORECLAIM in the contiguous page allocator
We added _NORECLAIM to request that kmem_alloc_contig_pages() not spend
time scanning physical memory for candidates to reclaim.  In some
situations the scanning can induce large amounts of undesirable latency,
and it's less important that the request be satisfied than it is that we
not spend many milliseconds scanning.

The problem extends to vm_reserv_reclaim_contig(), which unlike
vm_reserv_reclaim() may have to scan the entire list of partially
populated reservations.  Use VM_ALLOC_NORECLAIM to request that this
scan not be executed.[1]

As a side effect, this fixes a regression in 02fb0585e7 ("vm_page:
Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain()")
where VM_ALLOC_CONTIG was not included in VPAC_FLAGS or VPANC_FLAGS even
though it is not masked by kmem_alloc_contig_pages().[2]

Reported by:	gallatin [1], glebius [2]
Reviewed by:	alc, glebius, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32899
2021-11-11 14:26:41 -05:00
Gordon Bergling
c28e39c3d6 Fix a common typo in syctl descriptions
- s/maxiumum/maximum/

MFC after:	3 days
2021-11-03 20:49:24 +01:00
Mark Johnston
7585c5db25 uma: Fix handling of reserves in zone_import()
Kegs with no items reserved have uk_reserve = 0.  So the check
keg->uk_reserve >= dom->ud_free_items will be true once all slabs are
depleted.  Then, rather than go and allocate a fresh slab, we return to
the cache layer.

The intent was to do this only when the keg actually has a reserve, so
modify the check to verify this first.  Another approach would be to
make uk_reserve signed and set it to -1 until uma_zone_reserve() is
called, but this requires a few casts elsewhere.

Fixes:	1b2dcc8c54 ("uma: Avoid depleting keg reserves when filling a bucket")
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32516
2021-11-01 09:51:43 -04:00
Mark Johnston
fab343a716 uma: Improve M_USE_RESERVE handling in keg_fetch_slab()
M_USE_RESERVE is used in a couple of places in the VM to avoid unbounded
recursion when the direct map is not available, as is the case on 32-bit
platforms or when certain kernel sanitizers (KASAN and KMSAN) are
enabled.  For example, to allocate KVA, the kernel might allocate a
kernel map entry, which might require a new slab, which requires KVA.

For these zones, we use uma_prealloc() to populate a reserve of items,
and then in certain serialized contexts M_USE_RESERVE can be used to
guarantee a successful allocation.  uma_prealloc() allocates the
requested number of items, distributing them evenly among NUMA domains.
Thus, in a first-touch zone, to satisfy an M_USE_RESERVE allocation we
might have to check the slab lists of other domains than the current one
to provide the semantics expected by consumers.

So, try harder to find an item if M_USE_RESERVE is specified and the keg
doesn't have anything for current (first-touch) domain.  Specifically,
fall back to a round-robin slab allocation.  This change fixes boot-time
panics on NUMA systems with KASAN or KMSAN enabled.[1]

Alternately we could have uma_prealloc() allocate the requested number
of items for each domain, but for some existing consumers this would be
quite wasteful.  In general I think keg_fetch_slab() should try harder
to find free slabs in other domains before trying to allocate fresh
ones, but let's limit this to M_USE_RESERVE for now.

Also fix a separate problem that I noticed: in a non-round-robin slab
allocation with M_WAITOK, rather than sleeping after a failed slab
allocation we simply try again.  Call vm_wait_domain() before retrying.

Reported by:	mjg, tuexen [1]
Reviewed by:	alc
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32515
2021-11-01 09:51:18 -04:00
Konstantin Belousov
350fc36b4c sysctl vm.objects: yield if hog
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31163
2021-10-25 20:34:02 +03:00
Konstantin Belousov
7738118e9a vm.objects_swap: disable reporting some information
For making the call faster, do not count active/inactive object queues,
and do not report vnode info if any (for tmpfs).

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31163
2021-10-25 20:34:01 +03:00
Konstantin Belousov
42812ccc96 Add vm.swap_objects sysctl
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31163
2021-10-25 20:34:01 +03:00
Konstantin Belousov
1b610624fd vm_object_list: split sysctl handler in separate function
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31163
2021-10-25 20:34:01 +03:00
Mark Johnston
d7acbe481d vm_page: Break reservations to handle noobj allocations
vm_reserv_reclaim_*() will release pages to the default freepool, not
the direct freepool from which noobj allocations are drawn.  But if both
pools are empty, the noobj allocator variants must break reservations to
make progress.

Reported by:	cy
Reviewed by:	kib (previous version)
Fixes:	b498f71bc5 ("vm_page: Add a new page allocator interface for unnamed pages")
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32592
2021-10-22 09:25:59 -04:00
Mark Johnston
a9d6f1fe0a Remove some remaining references to VM_ALLOC_NOOBJ
Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32037
2021-10-19 21:22:56 -04:00
Mark Johnston
b801c79dda vm_fault: Stop specifying VM_ALLOC_ZERO
Now vm_page_alloc() and friends will unconditionally preserve PG_ZERO,
so there is no point in setting this flag.

Eliminate a local variable and add a comment explaining why we
prioritize the allocation when the process is doomed.

No functional change intended.

Reviewed by:	kib, alc
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32036
2021-10-19 21:22:56 -04:00
Mark Johnston
02fb0585e7 vm_page: Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain()
As in vm_page_alloc_domain_after(), unconditionally preserve PG_ZERO.

Implement vm_page_alloc_noobj_contig_domain().

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32034
2021-10-19 21:22:56 -04:00
Mark Johnston
c40cf9bc62 vm_page: Stop handling VM_ALLOC_NOOBJ in vm_page_alloc_domain_after()
This makes the allocator simpler since it can assume object != NULL.
Also modify the function to unconditionally preserve PG_ZERO, so
VM_ALLOC_ZERO is effectively ignored (and still must be implemented by
the caller for now).

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32033
2021-10-19 21:22:56 -04:00
Mark Johnston
84c3922243 Convert consumers to vm_page_alloc_noobj_contig()
Remove now-unneeded page zeroing.  No functional change intended.

Reviewed by:	alc, hselasky, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32006
2021-10-19 21:22:56 -04:00
Mark Johnston
92db9f3bb7 Introduce vm_page_alloc_noobj_contig()
This is the same as vm_page_alloc_noobj(), but allocates physically
contiguous runs of memory.  For now it is implemented in terms of
vm_page_alloc_contig(), with the difference that
vm_page_alloc_noobj_contig() implements VM_ALLOC_ZERO by zeroing the
page.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32005
2021-10-19 21:22:56 -04:00
Mark Johnston
a4667e09e6 Convert vm_page_alloc() callers to use vm_page_alloc_noobj().
Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ.  In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by:	alc, hselasky, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31986
2021-10-19 21:22:56 -04:00
Mark Johnston
b498f71bc5 vm_page: Add a new page allocator interface for unnamed pages
The diff adds vm_page_alloc_noobj() and vm_page_alloc_noobj_domain().
These mostly correspond to vm_page_alloc() and vm_page_alloc_domain()
when no VM object is specified, with the exception that they handle
VM_ALLOC_ZERO by zeroing the page, rather than by preserving PG_ZERO.

This simplifies callers and will permit simplification of the
vm_page_alloc_domain() definition.

Since the new allocator variant is similar to vm_page_alloc_freelist(),
implement both of them using a common backend allocator function.  No
functional change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31985
2021-10-19 21:22:55 -04:00
Mark Johnston
a23e6a1078 vm_page: Move vm_page_alloc_check() to after page allocator definitions
This way all of the vm_page_alloc_*() allocator functions are grouped
together.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-10-19 21:22:50 -04:00
Edward Tomasz Napierala
0f559a9f09 Make vmdaemon timeout configurable
Make vmdaemon timeout configurable, so that one can adjust
how often it runs.

Here's a trick: set this to 1, then run 'limits -m 0 sh',
then run whatever you want with 'ktrace -it XXX', and observe
how the working set changes over time.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D22038
2021-10-17 13:49:29 +01:00
Dawid Gorecki
889b56c8cd setrlimit: Take stack gap into account.
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.

Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.

PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516
2021-10-15 10:21:47 +02:00
Warner Losh
cdccd11b36 forward declare struct thread
sys/sysctl.h moved struct thread forward declaration under #ifdef
_KERNEL and so this header fails when included from userland. Add a
forward declaration here.

Fixes:	     		99eefc727e
Sponsored by:		Netflix
2021-10-11 12:59:39 -06:00
Konstantin Belousov
174aad047e vm_fault: do not trigger OOM too early
Wakeup in vm_waitpfault() does not mean that the thread would get the
page on the next vm_page_alloc() call, other thread might steal the free
page we were waiting for. On the other hand, this wakeup might come much
earlier than just vm_pfault_oom_wait seconds, if the rate of the page
reclamation is high enough.

If wakeups come fast and we loose the allocation race enough times, OOM
could be undeservably triggered much earlier than vm_pfault_oom_attempts
x vm_pfault_oom_wait seconds.  Fix it by not counting the number of sleeps,
but measuring the time to th first allocation failure, and triggering OOM
when it was older than oom_attempts x oom_wait seconds.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D32287
2021-10-08 12:24:46 +03:00
Mitchell Horne
31991a5a45 minidump: De-duplicate is_dumpable()
The function is identical in each minidump implementation, so move it to
vm_phys.c. The only slight exception is powerpc where the function was
public, for use in moea64_scan_pmap().

Reviewed by:	kib, markj, imp (earlier version)
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D31884
2021-09-29 16:41:52 -03:00
Gleb Smirnoff
183f8e1e57 Externalize nsw_cluster_max and initialize it early.
GEOM_ELI needs to know the value, cause it will soon have special
memory handling for IO operations associated with swap.

Move initialization to swap_pager_init(), which is executed at
SI_SUB_VM, unlike swap_pager_swap_init(), which would be executed
only when a swap is configured. GEOM_ELI might need the value at
SI_SUB_DRIVERS, when disks are tasted by GEOM.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D24400
2021-09-28 11:23:52 -07:00
Gleb Smirnoff
c6213beff4 Add flag BIO_SWAP to mark IOs that are associated with swap.
Submitted by:		jtl
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D24400
2021-09-28 11:23:51 -07:00
Konstantin Belousov
bd3a668087 vm_page_startup: correct calculation of the starting page
Also avoid unneded calculations when phys segment end is the phys_avail[]
start.

Submitted by:	alc
Reviewed by:	markj
MFC after:	1 week
Fixes:	181bfb42fd
Differential revision:	https://reviews.freebsd.org/D32009
2021-09-19 21:27:55 +03:00
Mark Johnston
d6e77cda9b uma: Show the count of free slabs in each per-domain keg's sysctl tree
This is useful for measuring the number of pages that could be freed
from a NOFREE zone under memory pressure.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-09-17 14:19:05 -04:00
Konstantin Belousov
181bfb42fd vm_phys: do not ignore phys_avail[] segments that do not fit completely into vm_phys segments
If phys_avail[] segment only intersect with some vm_phys segment, add
pages from it to the free list that belong to the given vm_phys_seg,
instead of dropping them.

The vm_phys segments are generally result of subdivision of phys_avail
segments, for instance DMA32 or LOWMEM boundaries split them. On
amd64, after UEFI in-place kernel activation (copy_staging disable)
was enabled, we typically have a large phys_avail[] segment below 4G
which crosses LOWMEM (1M) boundary. With the current way of requiring
phys_avail[] fully fit into vm_phys_seg, this memory was ignored.

Reported by:	madpilot
Reviewed by:	markj
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31958
2021-09-16 20:01:19 +03:00
Mark Johnston
686aa9287c swap_pager: Handle large swap_pager_reserve() requests
This interface is used solely by md(4) when the MD_RESERVE flag is
specified, as in `mdconfig -a -t swap -s 1G -o reserve`.  It
pre-allocates swap blocks for the entire object.

The number of blocks to be reserved is specified as a vm_size_t, but
swp_pager_getswapspace() can allocate at most INT_MAX blocks.  vm_size_t
also seems like the incorrect type to use here it refers only to the
size of the VM object, not the size of a mapping.  So:
- change the type of "size" in swap_pager_reserve() to vm_pindex_t, and
- clamp the requested number of blocks for a single
  swp_pager_getswapspace() call to INT_MAX.

Reported by:	syzkaller
Reviewed by:	dougm, alc, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31875
2021-09-07 14:04:50 -04:00
Bjoern A. Zeeb
eccb516db8 vm: use __func__ for the correct function name
In fee2a2fa39 the KASSERTs in
vm_page_unwire_noq() changed from "vm_page_unwire" to "vm_page_unref".
While the former no longer was part of that function the latter does
not exist as a function and is highly confusing when hit when using
tools to lookup the functions and not doing a full-text search.
Use %s __func__ for printing the function name, as that will do the
right thing as code moves around and functions get renamed.

Hit:	while debugging a wired page leak with linuxkpi/iwlwifi
Sponsored by:	The FreeBSD Foundation
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D31635
2021-08-22 17:43:12 +00:00
Gordon Bergling
fa7a635f7e Fix a few typos in source code comments
- s/becase/because/

MFC after:	5 days
2021-08-14 09:06:09 +02:00
Mark Johnston
100949103a uma: Add KMSAN hooks
For now, just hook the allocation path: upon allocation, items are
marked as initialized (absent M_ZERO).  Some zones are exempted from
this when it would otherwise raise false positives.

Use kmsan_orig() to update the origin map for UMA and malloc(9)
allocations.  This allows KMSAN to print the return address when an
uninitialized UMA item is implicated in a report.  For example:
  panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe

Sponsored by:	The FreeBSD Foundation
2021-08-10 21:27:54 -04:00
Mark Johnston
8978608832 amd64: Populate the KMSAN shadow maps and integrate with the VM
- During boot, allocate PDP pages for the shadow maps.  The region above
  KERNBASE is currently not shadowed.
- Create a dummy shadow for the vm page array.  For now, this array is
  not protected by the shadow map to help reduce kernel memory usage.
- Grow shadows when growing the kernel map.
- Increase the default kernel stack size when KMSAN is enabled.  As with
  KASAN, sanitizer instrumentation appears to create stack frames large
  enough that the default value is not sufficient.
- Disable UMA's use of the direct map when KMSAN is configured.  KMSAN
  cannot validate the direct map.
- Disable unmapped I/O when KMSAN configured.
- Lower the limit on paging buffers when KMSAN is configured.  Each
  buffer has a static MAXPHYS-sized allocation of KVA, which in turn
  eats 2*MAXPHYS of space in the shadow map.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31295
2021-08-10 21:27:53 -04:00
Ka Ho Ng
de2e152959 Add vnode_pager_purge_range(9) KPI
This KPI is created in addition to the existing vnode_pager_setsize(9)
KPI. The KPI is intended for file systems that are able to turn a range
of file into sparse range, also known as hole-punching.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27194
2021-08-05 22:52:26 +08:00
Konstantin Belousov
0ef5eee9d9 Add vn_lktype_write()
and remove repetetive code that calculates vnode locking type for write.

Reviewed by:	khng, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31405
2021-08-04 19:40:13 +03:00
Konstantin Belousov
041b7317f7 Add pmap_vm_page_alloc_check()
which is the place to put MD asserts about allocated pages.

On amd64, verify that allocated page does not belong to the kernel
(text, data) or early allocated pages.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31121
2021-07-31 16:53:42 +03:00
Mark Johnston
4e8e26a004 redzone: Raise a compile error if KASAN is configured
redzone(9) does some munging of the allocation to insert redzones before
and after a valid memory buffer, but KASAN does not know about this and
will raise false positives if both are configured.  Until this is fixed,
do not allow both to be configured.  Note that KASAN provides similar
checking on its own but currently does not force the creation of
redzones for all UMA allocations; this should be addressed as well.

Sponsored by:	The FreeBSD Foundation
2021-07-23 10:47:13 -04:00
Mark Johnston
b0dfc48684 uma: Fix a few problems with KASAN integration
- Ensure that all items returned by UMA are aligned to
  KASAN_SHADOW_SCALE (8).  This was true in practice since smaller
  alignments are not used by any consumers, but we should enforce it
  anyway.
- Use a non-zero code for marking redzones that appear naturally in
  items that are not a multiple of the scale factor in size.  Currently
  we do not modify keg layouts to force the creation of redzones.
- Use a non-zero code for marking freed per-CPU items, otherwise
  accesses of freed per-CPU items are not detected by the runtime.

Sponsored by:	The FreeBSD Foundation
2021-07-09 20:38:50 -04:00
Konstantin Belousov
5b10e79edb Un-staticise vm_page_init_page()
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30785
2021-06-17 16:58:44 +03:00
Mateusz Guzik
128e25842e vm: add another pager private flag
Move OBJ_SHADOWLIST around to let pager flags be next to each other.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D30258
2021-05-15 20:47:29 +00:00
Konstantin Belousov
28bc23ab92 tmpfs: dynamically register tmpfs pager
Remove OBJT_SWAP_TMPFS. Move tmpfs-specific swap pager bits into
tmpfs_subr.c.

There is no longer any code to directly support tmpfs in sys/vm, most
tmpfs knowledge is shared by non-anon swap object type implementation.
The tmpfs-specific methods are provided by registered tmpfs pager, which
inherits from the swap pager.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30168
2021-05-13 20:13:34 +03:00
Konstantin Belousov
b730fd30b7 vm: Add KPI to dynamically register pagers
Pager is allowed to inherit part of its implementation from the existing
pager, which is done by copying non-NULL virtual method slots.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30168
2021-05-13 20:12:29 +03:00
Konstantin Belousov
7079449b0b sys/vm: remove several other uses of OBJT_SWAP_TMPFS
Mostly in cases where OBJ_SWAP flag works as well, or by reversing the
condition so that object types can be listed.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30168
2021-05-13 20:10:35 +03:00
Konstantin Belousov
3e7a11ca21 vm_object_set_memattr(): handle all object types without listing them explicitly
This avoids the need to know all existing object types in advance, by the
cost of loosing the assert that unknown object type is handled in a sane
manner.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30168
2021-05-13 20:10:35 +03:00
Konstantin Belousov
00a3fe968b vm_object_kvme_type(): reimplement by embedding kvme_type into pagerops
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30168
2021-05-13 20:10:35 +03:00
Mark Johnston
9246b3090c fork: Suspend other threads if both RFPROC and RFMEM are not set
Otherwise, a multithreaded parent process may trigger races in
vm_forkproc() if one thread calls rfork() with RFMEM set and another
calls rfork() without RFMEM.

Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to
see if the address space is shared.

Reported by:	syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com
Reported by:	syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30220
2021-05-13 08:33:23 -04:00
Mark Johnston
06d1fd9f42 swap_pager: Zero swap info before exporting to userspace
Otherwise padding bytes are leaked.

Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-05-12 12:52:05 -04:00
Konstantin Belousov
d474440ab3 Constify vm_pager-related virtual tables.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
4b8365d752 Add OBJT_SWAP_TMPFS pager
This is OBJT_SWAP pager, specialized for tmpfs.  Right now, both swap pager
and generic vm code have to explicitly handle swap objects which are tmpfs
vnode v_object, in the special ways.  Replace (almost) all such places with
proper methods.

Since VM still needs a notion of the 'swap object', regardless of its
use, add yet another type-classification flag OBJ_SWAP. Set it in
vm_object_allocate() where other type-class flags are set.

This change almost completely eliminates the knowledge of tmpfs from VM,
and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
0d2dfc6fed pagertab: use designated initializers
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
838adc533f Style enum obj_type
Put each type into dedicated line, which makes addition of new
types cleaner.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
a7c198a24b Implement vm_object_vnode() using vm_pager_getvp()
Allow vp_heldp argument to be NULL, in which case the returned vnode
is not held for tmpfs swap objects.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
1390a5cbeb Add pgo_freespace method
Makes the code in vm_object collapse/page_remove cleaner

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
192112b74f Add pgo_getvp method
This eliminates the staircase of conditions in vm_map_entry_set_vnode_text().

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
c23c555bc1 Add pgo_mightbedirty method
Used to implement vm_object_mightbedirty()

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
180bcaa46c vm_pager: add pgo_set_writeable_dirty method
specialized for swap and vnode pagers, and used to implement
vm_object_set_writeable_dirty().

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Konstantin Belousov
ee4211bca6 vm_pager: style some wrappers
Fill lines with the function definitions.
Use local var to shorten repeated extra-long expressions.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:02 +03:00
Konstantin Belousov
a0850dd057 swappagerops: slightly more style-compliant formatting
Remove excess spaces from comments.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:02 +03:00
Mark Johnston
9a7c2de364 realloc: Fix KASAN(9) shadow map updates
When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA.  This value is "alloc".  Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map.  Do so using the correct size.

Reported by:	kp
Tested by:	kp
Sponsored by:	The FreeBSD Foundation
2021-05-05 17:12:51 -04:00
Alexander Motin
2760658b21 Improve UMA cache reclamation.
When estimating working set size, measure only allocation batches, not free
batches.  Allocation and free patterns can be very different.  For example,
ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call,
but it does not mean it will request the same amount back that fast too, in
fact it won't.

Update working set size on every reclamation call, shrinking caches faster
under pressure.  Lack of this caused repeating vm_lowmem events squeezing
more and more memory out of real consumers only to make it stuck in UMA
caches.  I saw ZFS drop ARC size in half before previous algorithm after
periodic WSS update decided to reclaim UMA caches.

Introduce voluntary reclamation of UMA caches not used for a long time. For
each zdom track longterm minimal cache size watermark, freeing some unused
items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed
memory can get better use by other consumers.  For example, ZFS won't grow
its ARC unless it see free memory, since it does not know it is not really
used.  And even if memory is not really needed, periodic free during
inactivity periods should reduce its fragmentation.

Reviewed by:	markj, jeff (previous version)
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29790
2021-05-02 19:45:23 -04:00
Konstantin Belousov
ecfbddf0cd sysctl vm.objects: report backing object and swap use
For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object.  This allows userspace
to deconstruct the shadow chain.  Right now the handle is the address
of the object in KVA, but this is not guaranteed.

For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field.  I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.

For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29771
2021-04-19 21:32:01 +03:00
Mark Johnston
aabe13f145 uma: Introduce per-domain reclamation functions
Make it possible to reclaim items from a specific NUMA domain.

- Add uma_zone_reclaim_domain() and uma_reclaim_domain().
- Permit parallel reclamations.  Use a counter instead of a flag to
  synchronize with zone_dtor().
- Use the zone lock to protect cache_shrink() now that parallel reclaims
  can happen.
- Add a sysctl that can be used to trigger reclamation from a specific
  domain.

Currently the new KPIs are unused, so there should be no functional
change.

Reviewed by:	mav
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29685
2021-04-14 13:03:34 -04:00
Mark Johnston
54f421f9e8 uma: Split bucket_cache_drain() to permit per-domain reclamation
Note that the per-domain variant does not shrink the target bucket size.

No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-04-14 13:03:34 -04:00
Mark Johnston
2b914b85dd kmem: Add KASAN state transitions
Memory allocated with kmem_* is unmapped upon free, so KASAN doesn't
provide a lot of benefit, but since allocations are always a multiple of
the page size we can create a redzone when the allocation request size
is not a multiple of the page size.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29458
2021-04-13 17:42:21 -04:00
Mark Johnston
244f3ec642 kstack: Add KASAN state transitions
We allocate kernel stacks using a UMA cache zone.  Cache zones have
KASAN disabled by default, but in this case it makes sense to enable it.

Reviewed by:	andrew
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29457
2021-04-13 17:42:21 -04:00
Mark Johnston
09c8cb717d uma: Add KASAN state transitions
- Add a UMA_ZONE_NOKASAN flag to indicate that items from a particular
  zone should not be sanitized.  This is applied implicitly for NOFREE
  and cache zones.
- Add KASAN call backs which get invoked:
  1) when a slab is imported into a keg
  2) when an item is allocated from a zone
  3) when an item is freed to a zone
  4) when a slab is freed back to the VM

  In state transitions 1 and 3, memory is poisoned so that accesses will
  trigger a panic.  In state transitions 2 and 4, memory is marked
  valid.
- Disable trashing if KASAN is enabled.  It just adds extra CPU overhead
  to catch problems that are detected by KASAN.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29456
2021-04-13 17:42:21 -04:00
Mark Johnston
982693bb72 vm_fault: Shoot down multiply mapped COW source page mappings
Reviewed by:	kib, rlibby
Discussed with:	alc
Approved by:	so
Security:	CVE-2021-29626
Security:	FreeBSD-SA-21:08.vm
2021-04-06 14:49:28 -04:00
Konstantin Belousov
89619b747b Add sysctl debug.uma_reclaim
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-04 20:39:06 +03:00
Konstantin Belousov
51a7be5f60 Style
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-04-04 20:39:06 +03:00
Jason A. Harmening
8dc8feb53d Clean up a couple of MD warts in vm_fault_populate():
--Eliminate a big ifdef that encompassed all currently-supported
architectures except mips and powerpc32.  This applied to the case
in which we've allocated a superpage but the pager-populated range
is insufficient for a superpage mapping.  For platforms that don't
support superpages the check should be inexpensive as we shouldn't
get a superpage in the first place.  Make the normal-page fallback
logic identical for all platforms and provide a simple implementation
of pmap_ps_enabled() for MIPS and Book-E/AIM32 powerpc.

--Apply the logic for handling pmap_enter() failure if a superpage
mapping can't be supported due to additional protection policy.
Use KERN_PROTECTION_FAILURE instead of KERN_FAILURE for this case,
and note Intel PKU on amd64 as the first example of such protection
policy.

Reviewed by:	kib, markj, bdragon
Differential Revision:	https://reviews.freebsd.org/D29439
2021-03-30 18:15:55 -07:00
Konstantin Belousov
c7b913aa47 vm_fault: handle KERN_PROTECTION_FAILURE
pmap_enter(PMAP_ENTER_LARGEPAGE) may return KERN_PROTECTION_FAILURE due to
PKRU inconsistency.  Handle it in the call place from vm_fault_populate(),
and in places which decode errors from vm_fault_populate()/
vm_fault_allocate().

Reviewed by:	jah, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29442
2021-03-27 20:16:27 +02:00
Bryan Drewery
a771bf748f Remove unused obj variable missed in r354870.
Sponsored by:	Dell EMC
2021-03-17 15:29:15 -07:00
Kristof Provost
b8f7267d49 uma: allow uma_zfree_pcu(..., NULL)
We already allow free(NULL) and uma_zfree(..., NULL). Make
uma_zfree_pcpu(..., NULL) work as well.
This also means that counter_u64_free(NULL) will work.

These make cleanup code simpler.

MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29189
2021-03-12 12:12:35 +01:00
Mark Johnston
968079f253 vm_reserv: Fix list locking in vm_reserv_reclaim_contig()
The per-domain partpop queue is locked by the combination of the
per-domain lock and individual reservation mutexes.
vm_reserv_reclaim_contig() scans the queue looking for partially
populated reservations that can be reclaimed in order to satisfy the
caller's allocation.

During the scan, we drop the per-domain lock.  At this point, the rvn
pointer may be invalidated.  Take care to load rvn after re-acquiring
the per-domain lock.

While here, simplify the condition used to check whether a reservation
was dequeued while the per-domain lock was dropped.

Reviewed by:	alc, kib
Reported by:	gallatin
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29203
2021-03-11 10:35:35 -05:00
Mark Johnston
0401989282 vm: Round up npages and alignment for contig reclamation
When searching for runs to reclaim, we need to ensure that the entire
run will be added to the buddy allocator as a single unit.  Otherwise,
it will not be visible to vm_phys_alloc_contig() as it is currently
implemented.  This is a problem for allocation requests that are not a
power of 2 in size, as with 9KB jumbo mbuf clusters.

Reported by:	alc
Reviewed by:	alc
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28924
2021-03-02 10:21:02 -05:00
Max Laier
14b5a3c7d5 vm pqbatch: move unmanaged page assert under pagequeue lock
This KASSERT is overzealous because of the following race condition:
 1) A managed page which is currently in PQ_LAUNDRY is freed.
    vm_page_free_prep calls vm_page_dequeue_deferred()

    The page state is:
       PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED

 2) The laundry worker comes around and pick up the page and calls
    vm_pageout_defer(m, PQ_LAUNDRY, true) to check if page is still in the
    queue.  We do a vm_page_astate_load and get
       PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED
    as per above.

 3) The laundry worker is pre-empted and another thread allocates our page
    from the free pool.  For example vm_page_alloc_domain_after calls
    vm_page_dequeue() and sets VPO_UNMANAGED because we are allocating for
    an OBJT_UNMANAGED object.

    The page state is:
       PQ_NONE, 0 - VPO_UNMANAGED

 4) The laundry worker resumes, and processes vm_pageout_defer based on the
    stale astate which leads to a call to vm_page_pqbatch_submit, which will
    trip on the KASSERT.

Submitted by:	mlaier
Reviewed by:	markj, rlibby
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D28563
2021-02-24 15:56:16 -08:00
Mark Johnston
537f92cd35 uma: Update the comment above startup_alloc() to reflect reality
The scheme used for early slab allocations changed in commit a81c400e75.

Reported by:	alc
Reviewed by:	alc
MFC after:	1 week
2021-02-22 18:22:51 -05:00
Mark Johnston
23e875fd97 vm_kern: Avoid sign extension in the KVA_QUANTUM definition
Otherwise, on a powerpc64 NUMA system with hashed page tables, the
first-level superpage reservation size is large enough that the value of
the kernel KVA arena import quantum, KVA_NUMA_IMPORT_QUANTUM, is
negative and gets sign-extended when passed to vmem_set_import().  This
results in a boot-time hang on such platforms.

Reported by:	bdragon
MFC after:	3 days
2021-02-22 15:50:09 -05:00
Alex Richardson
fa2528ac64 Use atomic loads/stores when updating td->td_state
KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.

Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.

Reported by:	GENERIC-KCSAN
Test Plan:	KCSAN no longer complains, kernel still runs fine.
Reviewed By:	markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569
2021-02-18 14:02:48 +00:00
John Baldwin
67932460c7 Add a VA_IS_CLEANMAP() macro.
This macro returns true if a provided virtual address is contained
in the kernel's clean submap.

In CHERI kernels, the buffer cache and transient I/O map are allocated
as separate regions.  Abstracting this check reduces the diff relative
to FreeBSD.  It is perhaps slightly more readable as well.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D28710
2021-02-17 16:32:11 -08:00
Mark Johnston
5c18744ea9 vm: Honour the "noreuse" flag to vm_page_unwire_managed()
This flag indicates that the page should be enqueued near the head of
the inactive queue, skipping the LRU queue.  It is used when unwiring
pages from the buffer cache following direct I/O or after I/O when
POSIX_FADV_NOREUSE or _DONTNEED advice was specified, or when
sendfile(SF_NOCACHE) completes.  For the direct I/O and sendfile cases
we only enqueue the page if we decide not to free it, typically because
it's mapped.

Pass "noreuse" through to vm_page_release_toq() so that we actually
honour the desired LRU policy for these scenarios.

Reported by:	bdrewery
Reviewed by:	alc, kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28555
2021-02-10 11:10:27 -05:00
Ryan Stone
660344ca44 Add a VM flag to prevent reclaim on a failed contig allocation
If a M_WAITOK contig alloc fails, the VM subsystem will try to
reclaim contiguous memory twice before actually failing the
request.  On a system with 64GB of RAM I've observed this take
400-500ms before it finally gives up, and I believe that this
will only be worse on systems with even more memory.

In certain contexts this delay is extremely harmful, so add a flag
that will skip reclaim for allocation requests to allow those
paths to opt-out of doing an expensive reclaim.

Sponsored by: Dell Inc
Differential Revision:	https://reviews.freebsd.org/D28422
Reviewed by: markj, kib
2021-02-03 16:16:51 -05:00
Brooks Davis
7a1591c1b6 Rename kern_mmap_req to kern_mmap
Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap.
Reand rename kern_mmap_req to kern_mmap                                .

The helper saved some code churn initially, but having multiple
interfaces is sub-optimal.

Obtained from:	CheriBSD
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28292
2021-01-25 21:50:37 +00:00
Konstantin Belousov
420d4be3e4 vm_map_protect(): remove not needed recalculations of new_prot, new_maxprot
Requested by:	alc
Sponsored by:	The FreeBSD Foundation
2021-01-14 10:02:43 +02:00
Konstantin Belousov
0659df6fad vm_map_protect: allow to set prot and max_prot in one go.
This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome.  E.g. you can get an error that is not
possible if operation is performed atomic.

Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.

Reviewed by:	brooks, markj (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28117
2021-01-13 01:35:22 +02:00
Konstantin Belousov
9402bb44f1 vmspace_fork: preserve wx settings in the child vm map after fork
Noted by:	markj
Sponsored by:	The FreeBSD Foundation
2021-01-12 08:09:59 +02:00
Konstantin Belousov
2e1c94aa1f Implement enforcing write XOR execute mapping policy.
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28050
2021-01-12 01:15:43 +02:00
Mark Johnston
663de81f85 uma: Avoid unmapping direct-mapped slabs
startup_alloc() uses pmap_map() to map slabs used for bootstrapping the
VM.  pmap_map() may ignore the hint address and simply return a range
from the direct map.  In this case we must not unmap the range in
startup_free().

UMA uses bootstart and bootmem to track the range of KVA into which
slabs are mapped if the direct map is not used.  Unmap a startup slab
only if it was mapped into that range.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27885
2021-01-03 11:50:31 -05:00
Ryan Libby
942951ba46 uma dbg: catch more corruption with atomics
Use atomic testandset and testandclear to catch concurrent double free,
and to reduce the number of atomic operations.

Submitted by:	jeff
Reviewed by:	cem, kib, markj (all previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22703
2020-12-31 13:02:45 -08:00
Mark Johnston
81846def34 vm: Fix some bugs in the page busying code
In vm_page_busy_acquire(), load the object pointer using
atomic_load_ptr() as we do elsewhere.  Per the comment, the object
identity must be consistent across sleeps.

In vm_page_grab_sleep(), pass the correct pindex to
_vm_page_busy_sleep().  The pindex is used to re-check the page's
identity before going to sleep.  In particular, vm_page_grab_sleep() is
used in unlocked grab, so the object lock is not necessarily held when
verifying the page's identity, and the pindex may change if the page is
moved, or freed and re-allocated.  I believe this can result in spurious
VM_PAGER_FAILs from vm_page_grab_valid_unlocked() or early termination
of vm_page_grab_pages_unlocked().

In vm_page_grab_pages(), pass the correct pindex to
vm_page_grab_sleep().  Otherwise I believe vm_page_grab_pages() will
effectively spin when attempting to busy a busy page after the first
index in the range.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27607
2020-12-27 17:01:44 -05:00
Mark Johnston
d2f1c44bc9 uma: Remove the MINBUCKET flag from the flag name list
This should have been done in r368399 / commit
f8b6c51538.

Reported by:	rlibby
Sponsored by:	The FreeBSD Foundation
2020-12-27 17:01:33 -05:00
Bryan Drewery
5fee468e83 Revert r368523 which fixed contig allocs waiting forever.
This needs to account for empty NUMA domains or domains which do not
satisfy the requested range.

Discussed with:	markj
2020-12-15 19:38:16 +00:00
Bryan Drewery
bbfec1633b contig allocs: Don't retry forever on M_WAITOK.
This restores behavior from before domain iterators were added in
r327895 and r327896.

The vm_domainset_iter_policy() will do a vm_wait_doms() and then
restart its iterator when M_WAITOK is set.  It will also force
the containing loop to have M_NOWAIT.  So we get an unbounded
retry loop rather than the intended bounded retries that
kmem_alloc_contig_pages() already handles.

This also restores M_WAITOK to the vmem_alloc() call in
kmem_alloc_attr_domain() and kmem_alloc_contig_domain().

Reviewed by:	markj, kib
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D27507
2020-12-10 20:44:29 +00:00
Mark Johnston
e574d407ae uma: Make uma_zone_set_maxcache() work better with small limits
The old implementation chose the largest bucket zone such that if the
per-CPU caches are fully populated, the total number of items cached is
no larger than the specified limit.  If no such zone existed, UMA would
not do any caching.

We can now use uz_bucket_size_max to set a precise limit on the number
of items in a zone's bucket, so the total size of per-CPU caches can be
bounded more easily.  Implement a new policy in uma_zone_set_maxcache():
choose a bucket size such that up to half of the limit can be cached in
per-CPU caches, with the rest going to the full bucket cache.  This
fixes a problem with the kstack_cache zone: the limit of 4 * mp_ncpus
items meant that the zone would not do any caching, defeating the whole
purpose of the zone.  That's because the smallest bucket size holds up
to 2 items and we may cache up to 3 full buckets per CPU, and
2 * 3 * mp_ncpus > 4 * mp_ncpus.

Reported by:	mjg
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27168
2020-12-06 22:45:50 +00:00
Mark Johnston
f8b6c51538 uma: Enforce the use of uz_bucket_size_max in the free path
uz_bucket_size_max is the maximum permitted bucket size.  When filling a
new bucket to satisfy uma_zalloc(), the bucket is populated with at most
uz_bucket_size_max items.  The maximum number of entries in the bucket
may be larger.  When freeing items, however, we will fill per-CPPU
buckets up to their maximum number of entries, potentially exceeding
uz_bucket_size_max.  This makes it difficult to precisely limit the
number of items that may be cached in a zone.  For example, if one wants
to limit buckets to 1 entry for a particular zone, that's not possible
since the smallest bucket holds up to 2 entries.

Try to solve the problem by using uz_bucket_size_max to limit the number
of entries in a bucket.  Note that the ub_entries field is initialized
upon every bucket allocation.  Most zones are not affected since they do
not impose any specific limit on the maximum bucket size.

While here, remove the UMA_ZONE_MINBUCKET flag.  It was unused and we
now have uma_zone_set_maxcache() to control the zone's cache size more
precisely.

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27167
2020-12-06 22:45:39 +00:00
Mark Johnston
8a6776ca0f uma: Use atomic load for uz_sleepers
This field is updated locklessly.

Sponsored by:	The FreeBSD Foundation
2020-12-06 22:45:22 +00:00
Mark Johnston
991f23ef20 uma: Avoid allocating buckets with the cross-domain lock held
Allocation of a bucket can trigger a cross-domain free in the bucket
zone, e.g., if the per-CPU alloc bucket is empty, we free it and get
migrated to a remote domain.  This can lead to deadlocks since a bucket
zone may allocate buckets from itself or a pair of bucket zones could be
allocating from each other.

Fix the problem by dropping the cross-domain lock before allocating a
new bucket and handling refill races.  Use a list of empty buckets to
ensure that we can make forward progress.

Reported by:	imp, mjg (witness(9) warnings)
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27341
2020-11-30 16:18:33 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Mark Johnston
1fea4b25c9 Wrap a long line in vm_pqbatch_process_page() 2020-11-19 15:41:42 +00:00
Mark Johnston
9e3e737608 Micro-optimize vm_page_pqbatch_submit()
Avoid calling vm_page_domain() twice.

Discussed with:	alc (in D27207)
2020-11-19 15:40:58 +00:00
Mark Johnston
431fb8abd7 vm_phys: Try to clean up NUMA KPIs
It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures.  We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.

Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain().  Make the latter an inline function.

Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers.  Include it from vm_page.h so that
vm_page_domain() can be defined there.

Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.

Reviewed by:	alc
Reviewed by:	dougm, kib (earlier versions)
Differential Revision:	https://reviews.freebsd.org/D27207
2020-11-19 03:59:21 +00:00
Mark Johnston
20f02659d6 vm_map: Handle kernel map entry allocator recursion
On platforms without a direct map[*], vm_map_insert() may in rare
situations need to allocate a kernel map entry in order to allocate
kernel map entries.  This poses a problem similar to the one solved for
vmem boundary tags by vmem_bt_alloc().  In fact the kernel map case is a
bit more complicated since we must allocate entries with the kernel map
locked, whereas vmem can recurse into itself because boundary tags are
allocated up-front.

The solution is to add a custom slab allocator for kmapentzone which
allocates KVA directly from kernel_map, bypassing the kmem_* layer.
This avoids mutual recursion with the vmem btag allocator.  Then, when
vm_map_insert() allocates a new kernel map entry, it avoids triggering
allocation of a new slab with M_NOVM until after the insertion is
complete.  Instead, vm_map_insert() allocates from the reserve and sets
a flag in kernel_map to trigger re-population of the reserve just before
the map is unlocked.  This places an implicit upper bound on the number
of kernel map entries that may be allocated before the kernel map lock
is released, but in general a bound of 1 suffices.

[*] This also comes up on amd64 with UMA_MD_SMALL_ALLOC undefined, a
configuration required by some kernel sanitizers.

Discussed with:	kib, rlibby
Reported by:	andrew
Tested by:	pho (i386 and amd64 with !UMA_MD_SMALL_ALLOC)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26851
2020-11-11 17:16:39 +00:00
Jonathan T. Looney
7b516613aa When destroying a UMA zone which has a reserve (set with
uma_zone_reserve()), messages like the following appear on the console:
"Freed UMA keg (Test zone) was not empty (0 items). Lost 528 pages of
memory."

When keg_drain_domain() is draining the zone, it tries to keep the number
of items specified in the reservation. However, when we are destroying the
UMA zone, we do not need to keep those items. Therefore, when destroying a
non-secondary and non-cache zone, we should reset the keg reservation to 0
prior to draining the zone.

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27129
2020-11-10 18:12:09 +00:00
Mateusz Guzik
3a440a421d Add more per-cpu zones.
This covers powers of 2 up to 64.

Example pending user is ZFS.
2020-11-09 00:34:23 +00:00
Leandro Lupori
e2d6c417e3 Implement superpages for PowerPC64 (HPT)
This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.

The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.

While these changes are not better tested, superpages support is disabled by
default. To enable it, use vm.pmap.superpages_enabled=1.

In this initial implementation, when superpages are disabled, system
performance stays at the same level as without these changes. When
superpages are enabled, buildworld time increases a bit (~2%). However,
for workloads that put a heavy pressure on the TLB the performance boost
is much bigger (see HPC Challenge and pgbench on D25237).

Reviewed by:	jhibbits
Sponsored by:	Eldorado Research Institute (eldorado.org.br)
Differential Revision:	https://reviews.freebsd.org/D25237
2020-11-06 14:12:45 +00:00
Mateusz Guzik
2dee296a3d Rationalize per-cpu zones.
The 2 provided zones had inconsistent naming between each other
("int" and "64") and other allocator zones (which use bytes).

Follow malloc by naming them "pcpu-" + size in bytes.

This is a step towards replacing ad-hoc per-cpu zones with
general slabs.
2020-11-05 15:08:56 +00:00
Mark Johnston
f7db0c9532 vmspace: Convert to refcount(9)
This is mostly mechanical except for vmspace_exit().  There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace.  In that case, upon switching to
vmspace0 we can unconditionally release the reference.

Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.

Reviewed by:	alc, kib, mmel
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27057
2020-11-04 16:30:56 +00:00
Alan Cox
ccfd886a1b Conditionally compile struct vm_phys_seg's md_first field. This field is
only used by arm64's pmap.

Reviewed by:	kib, markj, scottph
Differential Revision:	https://reviews.freebsd.org/D26907
2020-10-23 06:24:38 +00:00
Ed Maste
575a4437a9 uma: fix KTR message after r366840
Reported by:	bz
Sponsored by:	The FreeBSD Foundation
2020-10-19 18:54:44 +00:00
Mark Johnston
f09cbea31a uma: Respect uk_reserve in keg_drain()
When a reserve of free items is configured for a zone, the reserve must
not be reclaimed under memory pressure.  Modify keg_drain() to simply
respect the reserved pool.

While here remove an always-false uk_freef == NULL check (kegs that
shouldn't be drained should set _NOFREE instead), and make sure that the
keg_drain() KTR statement does not reference an uninitialized variable.

Reviewed by:	alc, rlibby
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26772
2020-10-19 16:57:40 +00:00
Mark Johnston
1b2dcc8c54 uma: Avoid depleting keg reserves when filling a bucket
zone_import() fetches a free or partially free slab from the keg and
then uses its items to populate an array, typically filling a bucket.
If a single allocation causes the keg to drop below its minimum reserve,
the inner loop ends.  However, if the bucket is still not full and
M_USE_RESERVE is specified, the outer loop will continue to fetch items
from the keg.

If M_USE_RESERVE is specified and the number of free items is below the
reserved limit, we should return only a single item.  Otherwise, if the
bucket size is larger than the reserve, all of the reserved items may
end up in a single per-CPU bucket, invisible to other CPUs.

Reviewed by:	rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26771
2020-10-19 16:55:03 +00:00
Konstantin Belousov
6f3b523c9a Avoid dump_avail[] redefinition.
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h.  This fixes default gcc build for mips.

Reviewed by:	alc, scottph
Tested by:	kevans (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26741
2020-10-14 22:51:40 +00:00
Bryan Drewery
c2c6fb90e0 Use unlocked page lookup for inmem() to avoid object lock contention
Reviewed By:	kib, markj
Submitted by:	mlaier
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D26653
2020-10-09 23:49:42 +00:00
Konstantin Belousov
42f96162c3 vm_page_dump_index_to_pa(): Add braces to the expression involving + and &.
The precedence of the '&' operator is less than of '+'.  Added braces
do change the order of evaluation into the natural one, in my opinion.
On the other hand, the value of the expression should not change since
all elements should have page-aligned values.

This fixes a gcc warning reported.

Reported by:	adrian
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-10-08 22:46:15 +00:00
Mark Johnston
2913cc4637 vm_pageout: Avoid rounding down the inactive scan target
With helper page daemon threads, enabled by default in r364786, we
divide the inactive target by the number of threads, rounding down, and
sum the total number of pages freed by the threads.  This sum is
compared with the original target, but by rounding down we might lose
pages, causing the page daemon control loop to conclude that inactive
queue scanning isn't keeping up with demand for free pages.  Typically
this results in excessive swapping.

Fix the problem by accounting for the error in the main pagedaemon
thread's target.  Note that by default the problem will manifest only in
systems with >16 CPUs in a NUMA domain.

Reviewed by:	cem
Discussed with:	dougm
Reported and tested by:	dhw, glebius
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26610
2020-10-02 19:16:06 +00:00
Mark Johnston
06d8bdcbf7 uma: Use the bucket cache for cross-domain allocations
uma_zalloc_domain() allocates from the requested domain instead of
following a first-touch policy (the default for most zones).  Currently
it is only used by malloc_domainset(), and consumers free returned items
with free(9) since r363834.

Previously uma_zalloc_domain() worked by always going to the keg for an
item.  As a result, the use of UMA zone caches was unbalanced: we free
items to the caches, but always allocate from the keg, skipping the
caches.

Make some effort to allocate from the UMA caches when performing a
cross-domain allocation.  This avoids blowing up the caches when
something is performing many transient allocations with
malloc_domainset().

Reported and tested by:	dhw, glebius
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26427
2020-10-02 19:04:29 +00:00
Mark Johnston
5afdf5c1ca uma: Use LIFO for non-SMR bucket caches
When SMR was introduced, zone_put_bucket() was changed to always place
full buckets at the end of the queue.  However, it is generally
preferable to use recently used buckets since their items are more
likely to be resident in cache.  So, for buckets that have no constraint
on item reuse, use a last-in-first-out ordering as we did before.

Reviewed by:	rlibby
Tested by:	dhw, glebius
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26426
2020-10-02 19:04:09 +00:00
Mark Johnston
952c8964ba uma: Remove newlines from panic messages
Sponsored by:	The FreeBSD Foundation
2020-10-02 19:03:42 +00:00
Mark Johnston
f31695cc64 Implement sparse core dumps
Currently we allocate and map zero-filled anonymous pages when dumping
core.  This can result in lots of needless disk I/O and page
allocations.  This change tries to make the core dumper more clever and
represent unbacked ranges of virtual memory by holes in the core dump
file.

Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to
clean up and return an error when it would otherwise map a zero-filled
page.  Then, in the core dumper code, prefault all user pages and handle
errors by simply extending the size of the core file.  This also fixes a
bug related to the fact that vn_io_fault1() does not attempt partial I/O
in the face of errors from vm_fault_quick_hold_pages(): if a truncated
file is mapped into a user process, an attempt to dump beyond the end of
the file results in an error, but this means that valid pages
immediately preceding the end of the file might not have been dumped
either.

The change reduces the core dump size of trivial programs by a factor of
ten simply by excluding unaccessed libc.so pages.

PR:		249067
Reviewed by:	kib
Tested by:	pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26590
2020-10-02 17:50:22 +00:00
Mark Johnston
114484b7ec Flag vm_reserv and vm_phys sysctls as MPSAFE.
Nothing in these subsystems relies on Giant.

MFC after:	1 week
2020-09-23 19:36:07 +00:00
Mark Johnston
78257765f2 Add a vmparam.h constant indicating pmap support for large pages.
Enable SHM_LARGEPAGE support on arm64.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc., Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26467
2020-09-23 19:34:21 +00:00
D Scott Phillips
de03184698 arm64/pmap: Sparsify pv_table
Reviewed by:	markj, kib
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26132
2020-09-21 22:23:57 +00:00
D Scott Phillips
7988971a99 vm_reserv: Sparsify the vm_reserv_array when VM_PHYSSEG_SPARSE
On an Ampere Altra system, the physical memory is populated
sparsely within the physical address space, with only about 0.4%
of physical addresses backed by RAM in the range [0, last_pa].

This is causing the vm_reserv_array to be over-sized by a few
orders of magnitude, wasting roughly 5 GiB on a system with
256 GiB of RAM.

The sparse allocation of vm_reserv_array is controlled by defining
VM_PHYSSEG_SPARSE, with the dense allocation still remaining for
platforms with VM_PHYSSEG_DENSE.

Reviewed by:	markj, alc, kib
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26130
2020-09-21 22:22:53 +00:00
D Scott Phillips
00e6614750 Sparsify the vm_page_dump bitmap
On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).

Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.

Reviewed by:	jhb
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26131
2020-09-21 22:21:59 +00:00
D Scott Phillips
ab041f713a Move vm_page_dump bitset array definition to MI code
These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.

Reviewed by:	kib, markj
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26129
2020-09-21 22:20:37 +00:00
Eric van Gyzen
f9cc8410e1 vm_ooffset_t is now unsigned
vm_ooffset_t is now unsigned. Remove some tests for negative values,
or make other adjustments accordingly.

Reported by:	Coverity
Reviewed by:	kib markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26214
2020-09-18 16:48:08 +00:00
Mark Johnston
97458520cc Increase the default vm.max_user_wired value.
Since r347532 (merged to stable/12) we only count user-wired pages
towards the system limit.  However, we now also treat pages wired by
hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with
large amounts of RAM tends to fail due to the low limit.

The purpose of the limit is to provide a seatbelt, not to impose some
policy on the use of wired memory.  Thus, increase the default limit to
allow reasonable VM configurations to work without tuning.

Reviewed by:	kib
Discussed with:	dougm
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26424
2020-09-17 16:49:28 +00:00
Konstantin Belousov
d301b3580f Support for userspace non-transparent superpages (largepages).
Created with shm_open2(SHM_LARGEPAGE) and then configured with
FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee
that all userspace mappings of it are served by superpage non-managed
mappings.

Only amd64 for now, both 2M and 1G superpages can be requested, the
later requires CPU feature.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 22:12:51 +00:00
Konstantin Belousov
e2e80fb3de vm_map: Add a map entry kind that can only be clipped at specific boundary.
The entries and their clip boundaries must be aligned on supported
superpages sizes from pagesizes[].  vm_map operations return Mach
error KERN_INVALID_ARGUMENT, which is usually translated to EINVAL, if
it would require clip not at the boundary.

In other words, entries force preserving virtual addresses superpage
properties.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 22:02:30 +00:00
Konstantin Belousov
6cadbcd203 Add pmap_enter(9) PMAP_ENTER_LARGEPAGE flag and implement it on amd64.
The flag requests entry of non-managed superpage mapping of size
pagesizes[psind] into the page table.

Pmap supports fake wiring of the largepage mappings.  Only attributes
of the largepage mapping can be changed by calling pmap_enter(9) over
existing mapping, physical address of the page must be unchanged.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:50:24 +00:00
Konstantin Belousov
7a9f2da33c Add vm_map_find_aligned(9).
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:44:59 +00:00
Konstantin Belousov
60cd9c95c5 Move MAP_32BIT_MAX_ADDR definition to sys/mman.h.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:39:06 +00:00
Konstantin Belousov
e8f77c204b Prepare to handle non-trivial errors from vm_map_delete().
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:34:31 +00:00
Konstantin Belousov
a720b31c2a Allow consumer to customize physical pager.
Add support for user-supplied callbacks into phys pager operations,
providing custom getpages(), haspage(), and populate() methods
implementations.  Pager stores user data ptr/val in the object to
provide context.

Add phys_pager_allocate() helper that takes user ops table as one of
the arguments.

Current code for these methods is moved to the 'default' ops table,
assigned automatically when vm_pager_alloc() is used.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 00:00:43 +00:00
Konstantin Belousov
67a659d282 Add kern_mmap_racct_check(), a helper to verify limits in vm_mmap*().
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-08 23:48:19 +00:00
Konstantin Belousov
89d2fb14d5 Add interruptible variant of vm_wait(9), vm_wait_intr(9).
Also add msleep flags argument to vm_wait_doms(9).

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-08 23:28:09 +00:00
Mark Johnston
aec9e7d8b0 vm_object_split(): Handle orig_object type changes.
orig_object->type can change from OBJT_DEFAULT to OBJT_SWAP while
vm_object_split() is sleeping.  In this case some pages in new_object
may be left unbusied, but vm_object_split() attempts to unbusy all of
them.

Track the beginning of the busied range.  Add an assertion to verify
that pages are not re-added to the source object while sleeping.

Reported by:	Olympios Petrakis <olympios.petrakis@netapp.com>
Reviewed by:	alc, kib
Tested by:	pho
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26223
2020-09-07 23:28:33 +00:00
Mark Johnston
a2d704d19f Avoid unnecessary object locking in vm_page_grab_pages_unlocked().
We were needlessly acquiring the object lock to call
vm_page_grab_pages() even when all of the requested pages were looked up
locklessly.  Fix that, stop testing for count == 0 in
vm_page_grab_pages(), and add assertions to help catch this kind of
mistake.

Reported by:	cem
Reviewed by:	alc, cem, dougm, jeff
Differential Revision:	https://reviews.freebsd.org/D26304
2020-09-02 19:59:25 +00:00
Mark Johnston
847ab36bf2 Include the psind in data returned by mincore(2).
Currently we use a single bit to indicate whether the virtual page is
part of a superpage.  To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.

The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl.  MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.

For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26238
2020-09-02 18:16:43 +00:00
Mateusz Guzik
c3aa3bf97c vm: clean up empty lines in .c and .h files 2020-09-01 21:20:45 +00:00
Vladimir Kondratyev
5d4bf0578f LinuxKPI: Implement ksize() function.
In Linux, ksize() gets the actual amount of memory allocated for a given
object. This commit adds malloc_usable_size() to FreeBSD KPI which does
the same. It also maps LinuxKPI ksize() to newly created function.

ksize() function is used by drm-kmod.

Reviewed by:	hselasky, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D26215
2020-08-29 19:26:31 +00:00
Eric van Gyzen
609de97e04 vm_pageout_scan_active: ensure ps_delta is initialized
Reported by:	Coverity
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26212
2020-08-28 19:59:02 +00:00
Eric van Gyzen
a2e194654f memstat_kvm_uma: fix reading of uma_zone_domain structures
Coverity flagged the scaling by sizeof(uzd).  That is the type
of the pointer, so the scaling was already done by pointer arithmetic.
However, this was also passing a stack frame pointer to kvm_read,
so it was doubly wrong.

Move ZDOM_GET into the !_KERNEL section and use it in libmemstat.

Reported by:	Coverity
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26213
2020-08-28 19:50:40 +00:00
Mark Johnston
aea9103e06 Use a large kmem arena import size on NUMA systems.
This helps minimize internal fragmentation that occurs when 2MB imports
are interleaved across NUMA domains.  Virtually all KVA allocations on
direct map platforms consume more than one page, so the fragmentation
manifests as runs of 511 4KB page mappings in the kernel.

Reviewed by:	alc, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26050
2020-08-26 14:31:48 +00:00
Conrad Meyer
74f5530d7a vm_pageout: Scale worker threads with CPUs
Autoscale vm_pageout worker threads from r364129 with CPU count.  The
default is arbitrarily chosen to be 16 CPUs per worker thread, but can
be adjusted with the vm.pageout_cpus_per_thread tunable.

There will never be less than 1 thread per populated NUMA domain, and
the previous arbitrary upper limit (at most ncpus/2 threads per NUMA
domain) is preserved.

Care is taken to gracefully handle asymmetric NUMA nodes, such as empty
node systems (e.g., AMD 2990WX) and systems with nodes of varying size
(e.g., some larger >20 core Intel Haswell/Broadwell Xeon).

Reviewed by:	kib, markj
Sponsored by:	Isilon
Differential Revision:	https://reviews.freebsd.org/D26152
2020-08-25 21:36:56 +00:00
Mark Johnston
411096d034 Permit vm_page_wire() to be called on pages not belonging to an object.
For such pages ref_count is effectively a consumer-managed field, but
there is no harm in calling vm_page_wire() on them.
vm_page_unwire_noq() handles them as well.  Relax the vm_page_wire()
assertions to permit this case which is triggered by some out-of-tree
code. [1]

Also guard a conditional assertion with INVARIANTS.  Otherwise the
conditions are evaluated even though the result is unused. [2]

Reported by:	bz, cem [1], kib [2]
Reviewed by:	dougm, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26173
2020-08-25 13:45:06 +00:00
Matt Macy
9e5787d228 Merge OpenZFS support in to HEAD.
The primary benefit is maintaining a completely shared
code base with the community allowing FreeBSD to receive
new features sooner and with less effort.

I would advise against doing 'zpool upgrade'
or creating indispensable pools using new
features until this change has had a month+
to soak.

Work on merging FreeBSD support in to what was
at the time "ZFS on Linux" began in August 2018.
I first publicly proposed transitioning FreeBSD
to (new) OpenZFS on December 18th, 2018. FreeBSD
support in OpenZFS was finally completed in December
2019. A CFT for downstreaming OpenZFS support in
to FreeBSD was first issued on July 8th. All issues
that were reported have been addressed or, for
a couple of less critical matters there are
pull requests in progress with OpenZFS. iXsystems
has tested and dogfooded extensively internally.
The TrueNAS 12 release is based on OpenZFS with
some additional features that have not yet made
it upstream.

Improvements include:
  project quotas, encrypted datasets,
  allocation classes, vectorized raidz,
  vectorized checksums, various command line
  improvements, zstd compression.

Thanks to those who have helped along the way:
Ryan Moeller, Allan Jude, Zack Welch, and many
others.

Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D25872
2020-08-25 02:21:27 +00:00