freebsd-dev

Author	SHA1	Message	Date
Doug Moore	da92ecbc0d	vm_phys: fix seg->end test in alloc_seg_contig In vm_phys_alloc_seg_contig, in allocating multiple memory blocks for a huge allocation, ensure that the end of the allocated range does not exceed the upper segment limit. Reorder a couple of checks to improve code layout. Reviewed by: alc MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33870	2022-01-18 12:49:09 -06:00
Mark Johnston	46d35d415a	fork: Copy the vm_stacktop field into the new vmspace Fixes: `1811c1e957` ("exec: Reimplement stack address randomization") Reported by: pho Reported by: syzbot+0446312a51bc13ead834@syzkaller.appspotmail.com Sponsored by: The FreeBSD Foundation	2022-01-18 10:51:49 -05:00
Mark Johnston	1811c1e957	exec: Reimplement stack address randomization The approach taken by the stack gap implementation was to insert a random gap between the top of the fixed stack mapping and the true top of the main process stack. This approach was chosen so as to avoid randomizing the previously fixed address of certain process metadata stored at the top of the stack, but had some shortcomings. In particular, mlockall(2) calls would wire the gap, bloating the process' memory usage, and RLIMIT_STACK included the size of the gap so small (< several MB) limits could not be used. There is little value in storing each process' ps_strings at a fixed location, as only very old programs hard-code this address; consumers were converted decades ago to use a sysctl-based interface for this purpose. Thus, this change re-implements stack address randomization by simply breaking the convention of storing ps_strings at a fixed location, and randomizing the location of the entire stack mapping. This implementation is simpler and avoids the problems mentioned above, while being unlikely to break compatibility anywhere the default ASLR settings are used. The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack, and is re-enabled by default. PR: 260303 Reviewed by: kib Discussed with: emaste, mw MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33704	2022-01-17 16:12:36 -05:00
Mark Johnston	a04ce833f9	uma: Avoid polling for an invalid SMR sequence number Buckets in an SMR-enabled zone can legitimately be tagged with SMR_SEQ_INVALID. This effectively means that the zone destructor (if any) was invoked on all items in the bucket, and the contained memory is safe to reuse. If the first bucket in the full bucket list was tagged this way, UMA would unnecessarily poll per-CPU state before attempting to fetch a full bucket from the list. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2022-01-14 15:38:02 -05:00
Mark Johnston	4a864f624a	vm_pageout: Print a more accurate message to the console before an OOM kill Previously we'd always print "out of swap space." This can be misleading, as there are other reasons an OOM kill can be triggered. In particular, it's entirely possible to trigger an OOM kill on a system with plenty of free swap space. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33810	2022-01-14 15:04:21 -05:00
Brooks Davis	0910a41ef3	Revert "syscallarg_t: Add a type for system call arguments" Missed issues in truss on at least armv7 and powerpcspe need to be resolved before recommit. This reverts commit `3889fb8af0`. This reverts commit `1544e0f5d1`.	2022-01-12 23:29:20 +00:00
Brooks Davis	1544e0f5d1	syscallarg_t: Add a type for system call arguments This more clearly differentiates system call arguments from integer registers and return values. On current architectures it has no effect, but on architectures where pointers are not integers (CHERI) and may not even share registers (CHERI-MIPS) it is necessiary to differentiate between system call arguments (syscallarg_t) and integer register values (register_t). Obtained from: CheriBSD Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D33780	2022-01-12 22:51:25 +00:00
Doug Moore	84e2ae64c5	vm_reserv: use enhanced bitstring for popmaps vm_reserv.c uses its own bitstring implemenation for popmaps. Using the bitstring_t type from a standard header eliminates the code duplication, allows some bit-at-a-time operations to be replaced with more efficient bitstring range operations, and, in vm_reserv_test_contig, allows bit_ffc_area_at to more efficiently search for a big-enough set of consecutive zero-bits. Make bitstring changes improve the vm_reserv code. Define a bit_ntest method to test whether a range of bits is all set, or all clear. Define bit_ff_at and bit_ff_area_at to implement the ffs and ffc versions with a parameter to choose between set- and clear- bits. Improve the area_at implementation. Modify the bit_nset and bit_nclear implementations to allow code optimization in the cases when start or end are multiples of _BITSTR_BITS. Add a few new cases to bitstring_test. Discussed with: alc Reviewed by: markj Tested by: pho (earlier version) Differential Revision: https://reviews.freebsd.org/D33312	2022-01-12 11:03:53 -06:00
Mark Johnston	c4a25e0713	vm_pageout: Group sysctl variables together with sysctl definitions Fix some style bugs while here. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33811	2022-01-11 09:27:45 -05:00
Mark Johnston	43b3b8e52d	swap_pager: uma_zcreate() doesn't fail Remove always-false checks for UMA zone creation failure. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33809	2022-01-11 09:27:45 -05:00
Doug Moore	ae13829ddc	vm_addr_ok: add power2 invariant check With INVARIANTS defined, have vm_addr_align_ok and vm_addr_bound_ok panic when passed an alignment/boundary parameter that is not a power of two. Reviewed by: alc Suggested by: kib, se Differential Revision: https://reviews.freebsd.org/D33725	2022-01-10 01:17:25 -06:00
Konstantin Belousov	c25a30e255	Dump page tracking no longer needed on mips Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D33763	2022-01-06 06:00:39 +02:00
Konstantin Belousov	f54882a862	Remove special kstack allocation code for mips. The arch required two-pages alignment due to single TLB entry caching two consequtive mappings. Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D33763	2022-01-06 04:43:56 +02:00
Doug Moore	f76916c095	vm_reserv: #include vm_extern.h explicitly, for arm. Fixes: `c606ab59e7` vm_extern: use standard address checkers everywhere	2021-12-31 00:40:25 -06:00
Doug Moore	e6930b1c5f	vm_phys: convert error back to warning Move an assignment back to where it was before, to turn the defined-but-not-used error back into a set-but-not-used warning. Fixes: `01e115ab83` vm_phys: #include vm_extern	2021-12-31 00:23:46 -06:00
Doug Moore	01e115ab83	vm_phys: #include vm_extern Arm64 and powerpc don't include vm_extern.h indirectly in vm_phys.c, which means that for the sake of those architectures, it must be included explicitly. Also, fix a set-unused warning that jenkins also found. Reported by: Jenkins Fixes: `c606ab59e7` vm_extern: use standard address checkers everywhere	2021-12-30 23:31:18 -06:00
Doug Moore	c606ab59e7	vm_extern: use standard address checkers everywhere Define simple functions for alignment and boundary checks and use them everywhere instead of having slightly different implementations scattered about. Define them in vm_extern.h and use them where possible where vm_extern.h is included. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D33685	2021-12-30 22:09:08 -06:00
Gleb Smirnoff	841e0a8757	uma: with KTR trace allocs/frees from SMR zones	2021-12-29 23:08:33 -08:00
Gleb Smirnoff	28782f73df	uma: with KTR report item being freed in uma_zfree_arg()	2021-12-29 23:08:15 -08:00
Doug Moore	8119cdd38b	vm_phys: hide vm_phys_set_pool It is only called in the file that defines it, so make it static and remove the declaration from the header. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D33688	2021-12-29 11:17:33 -06:00
John Baldwin	d90e41a154	sys/vm: Use C99 fixed-width integer types. No functional change. Reviewed by: imp, kib, emaste Differential Revision: https://reviews.freebsd.org/D33641	2021-12-28 09:43:21 -08:00
Doug Moore	49fd2d51f0	vm_reserv: fix zero-boundary error Handle specially the boundary==0 case of vm_reserv_reclaim_config, by turning off boundary adjustment in that case. Reviewed by: alc Tested by: pho, madpilot	2021-12-26 11:40:27 -06:00
Doug Moore	4bae154fe8	vm_page: Move a comment `fb38b29b56` (page_alloc_br) vm_page: Remove extra test, dup code from page alloc should have moved a comment block when it moved the function call that followed it. Move the comment block now.	2021-12-24 16:10:30 -06:00
Doug Moore	0d5fac2872	vm: alloc pages from reserv before breaking it Function vm_reserv_reclaim_contig breaks a reservation with enough free space to satisfy an allocation request and returns the free space to the buddy allocator. Change the function to allocate the request memory from the reservation before breaking it, and return that memory to the caller. That avoids a second call to the buddy allocator and guarantees successful allocation after breaking the reservation, where that success is not currently guaranteed. Reviewed by: alc, kib (previous version) Differential Revision: https://reviews.freebsd.org/D33644	2021-12-24 12:59:16 -06:00
Doug Moore	184c63db3c	Fix clerical error in page alloc Fix a very recent change that introduced a page accounting error in case of a reserveration being broken. Reviewed by: alc Fixes: `fb38b29b56` (page_alloc_br) vm_page: Remove extra test, dup code from page alloc Differential Revision: https://reviews.freebsd.org/D33645	2021-12-24 02:47:21 -06:00
Doug Moore	fb38b29b56	vm_page: Remove extra test, dup code from page alloc Extract code common to functions vm_page_alloc_contig_domain and vm_page_alloc_noobj_contig_domain into a new function. Do so in a way that eliminates a bound-to-fail reservation test after a reservation is broken by a call from vm_page_alloc_contig_domain. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D33551	2021-12-23 22:45:47 -06:00
Stephen J. Kiernan	18048b6e3c	Eliminate key press requirement "show vmopag" command output. Summary: One was required to press a key to continue after every 18 lines of output. This requirement had been in the "show vmopag" command since it was introduced, which was many years before paging was added to DDB. With paging, this explict key check is no longer necessary. Obtained from: Juniper Networks, Inc. MFC after: 1 week Test Plan: Run "show vmopag" from db> prompt and see that it does not need additional keypresses other than the ones needed for the pager. Subscribers: imp, #contributor_reviews_base Differential Revision: https://reviews.freebsd.org/D33550	2021-12-19 19:40:52 -05:00
Rick Macklem	cd37afd8b6	vm_object: Make is_object_active() global Commit `867c27c23a` modified the NFS client so that it does IO_APPEND writes directly to the NFS server, bypassing the buffer cache. However, this could result in stale data in client pages when the file is mmap(2)'d. As such, the NFS client needs to call is_object_active() to check if the file is mmap(2)'d. This patch renames is_object_active() to vm_object_is_active(), moves it to sys/vm/vm_object.c and makes it global, so that the NFS client can call it in a future commit. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D33520	2021-12-19 16:11:44 -08:00
Doug Moore	f7aa44763d	Correct type size format error in KASSERT. Reported by: jenkins Fixes: `6f1c890827` vm: Don't break vm reserv that can't meet align reqs	2021-12-16 13:48:58 -06:00
Doug Moore	6f1c890827	vm: Don't break vm reserv that can't meet align reqs Function vm_reserv_test_contig has incorrectly used its alignment and boundary parameters to find a well-positioned range of empty pages in a reservation. Consequently, a reservation could be broken mistakenly when it was unable to provide a satisfactory set of pages. Rename the function, correct the errors, and add assertions to detect the error in case it appears again. Reviewed by: alc, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33344	2021-12-16 12:20:56 -06:00
Mark Johnston	88642d978a	vm_fault: Fix vm_fault_populate()'s handling of VM_FAULT_WIRE vm_map_wire() works by calling vm_fault(VM_FAULT_WIRE) on each page in the rage. (For largepage mappings, it calls vm_fault() once per large page.) A pager's populate method may return more than one page to be mapped. If VM_FAULT_WIRE is also specified, we'd wire each page in the run, not just the fault page. Consider an object with two pages mapped in a vm_map_entry, and suppose vm_map_wire() is called on the entry. Then, the first vm_fault() would allocate and wire both pages, and the second would encounter a valid page upon lookup and wire it again in the regular fault handler. So the second page is wired twice and will be leaked when the object is destroyed. Fix the problem by modify vm_fault_populate() to wire only the fault page. Also modify the error handler for pmap_enter(psind=1) to not test fs->wired, since it must be false. PR: 260347 Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33416	2021-12-14 15:10:46 -05:00
Konstantin Belousov	5346570276	swapoff: add one more variant of the syscall Requested and reviewed by: brooks Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33343	2021-12-09 02:48:46 +02:00
Doug Moore	9f32cb5b1c	Set uninitialized popmap bits in vm_reserv_init In vm_reserv_init, set all the marker popmap bits in vm_reserv_init, and not just the bits of the first popmap entry. Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33258	2021-12-05 17:17:25 -06:00
Gleb Smirnoff	2cb67bd798	uma: remove unused *item argument from cache_free() Reviewed by: markj Differential revision: https://reviews.freebsd.org/D33272	2021-12-05 10:44:47 -08:00
Mark Johnston	39a7396f5d	vm_page: Tighten the object lock assertion in vm_page_invalid() A page must not become invalid while vm_fault_soft_fast() is attempting to map unbusied pages for reading. Note that all callers hold the object write lock already, and vm_page_set_invalid() asserts the object write lock. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33250	2021-12-05 10:51:11 -05:00
Konstantin Belousov	e8dc2ba29c	swapoff(2): add a SWAPOFF_FORCE flag The flag requests skipping the heuristic which tries to avoid leaving system with more allocated memory than available from RAM and remanining swap. Reviewed by: markj Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33165	2021-12-05 00:20:58 +02:00
Konstantin Belousov	a4e4132fa3	swapoff(2): replace special device name argument with a structure For compatibility, add a placeholder pointer to the start of the added struct swapoff_new_args, and use it to distinguish old vs. new style of syscall invocation. Reviewed by: markj Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33165	2021-12-05 00:20:58 +02:00
Konstantin Belousov	6df359449f	swap_pager.c: Remove MPSAFE and ARGSUSED annotations Reviewed by: markj Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33165	2021-12-05 00:20:58 +02:00
Konstantin Belousov	0190c38b9d	swapoff_one(): only check free pages count manually turning swap off When swap is turned off due to system shutdown or reboot, ignore the check. Problem is that the check is not accurate by any means, free page count can legitimately be low while system still able to page in everything from the swap. Then, we turn swap off if swapping on real file or some non-standard geom provider, and typically panic when system appears to actually need to unavailable page. For syscall, it is better to be safe than sorry. Reported and tested by: peterj Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33147	2021-11-29 18:38:02 +02:00
Mateusz Guzik	7e1d3eefd4	vfs: remove the unused thread argument from NDINIT* See `b4a58fbf64` ("vfs: remove cn_thread") Bump __FreeBSD_version to 1400043.	2021-11-25 22:50:42 +00:00
Konstantin Belousov	b19740f4ce	swap_pager: lock vnode in swapdev_strategy() VOP_STRATEGY() requires locked vnode. Note that we lock the swap vnode while pages are busy, but this would only cause real LoR if pages belong to the swap vnode, which must not be the case for correct use. Reported and tested by: peterj Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33119	2021-11-25 21:34:50 +02:00
Konstantin Belousov	6ddf41faa6	swapon: extend the region where the swap vnode is locked to cover VOP_GETATTR() call in sys_swapon(). Move locking from inside swapongeom() and swaponvp() into sys_swapon(). Reported by and tested by: peterj Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33119	2021-11-25 21:34:44 +02:00
Konstantin Belousov	a6d04f34a4	swap pager: lock vnode around VOP_CLOSE() Reported and tested by: peterj Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33119	2021-11-25 21:34:39 +02:00
Mark Johnston	d47d3a94bb	vm_fault: Factor out per-object operations into vm_fault_object() No functional change intended. Obtained from: jeff (object_concurrency patches) Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33018	2021-11-24 14:02:56 -05:00
Mark Johnston	f1b642c255	vm_fault: Introduce a fault_status enum for internal return types Rather than overloading the meanings of the Mach statuses, introduce a new set for use internally in the fault code. This makes the control flow easier to follow and provides some extra error checking when a fault status variable is used in a switch statement. vm_fault_lookup() and vm_fault_relookup() continue to use Mach statuses for now, as there isn't much benefit to converting them and they effectively pass through a status from vm_map_lookup(). Obtained from: jeff (object_concurrency patches) Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D33017	2021-11-24 14:02:55 -05:00
Mark Johnston	45c09a74d6	vm_fault: Move nera into faultstate This makes it easier to factor out pieces of vm_fault(). No functional change intended. Obtained from: jeff (object_concurrency patches) Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D33016	2021-11-24 14:02:55 -05:00
Mitchell Horne	10fe6f80a6	minidump: Use the provided dump bitset When constructing the set of dumpable pages, use the bitset provided by the state argument, rather than assuming vm_page_dump invariably. For normal kernel minidumps this will be a pointer to vm_page_dump, but when dumping the live system it will not. To do this, the functions in vm_dumpset.h are extended to accept the desired bitset as an argument. Note that this provided bitset is assumed to be derived from vm_page_dump, and therefore has the same size. Reviewed by: kib, markj, jhb MFC after: 2 weeks Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D31992	2021-11-19 15:05:52 -04:00
Brooks Davis	01ce7fca44	ommap: fix signed len and pos arguments 4.3 BSD's mmap took an int len and long pos. Reject negative lengths and in freebsd32 sign-extend pos correctly rather than mis-handling negative positions as large positive ones. Reviewed by: kib	2021-11-15 18:34:28 +00:00
Mark Johnston	d28af1abf0	vm: Add a mode to vm_object_page_remove() which skips invalid pages This will be used to break a deadlock in ZFS between the per-mountpoint teardown lock and page busy locks. In particular, when purging data from the page cache during dataset rollback, we want to avoid blocking on the busy state of invalid pages since the busying thread may be blocked on the teardown lock in zfs_getpages(). Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump __FreeBSD_version so that the OpenZFS port can make use of the new helper. PR: 258208 Reviewed by: avg, kib, sef Tested by: pho (part of a larger patch) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32931	2021-11-15 13:01:30 -05:00
Mark Johnston	a2665158d0	vm_page: Remove vm_page_sbusy() and vm_page_xbusy() They are unused today and cannot be safely used in the face of unlocked lookup, in which pages may be busied without the object lock held. Obtained from: jeff (object_concurrency patches) Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D32948	2021-11-15 13:01:30 -05:00
Mark Johnston	87b646630c	vm_page: Consolidate page busy sleep mechanisms - Modify vm_page_busy_sleep() and vm_page_busy_sleep_unlocked() to take a VM_ALLOC_* flag indicating whether to sleep on shared-busy, and fix up callers. - Modify vm_page_busy_sleep() to return a status indicating whether the object lock was dropped, and fix up callers. - Convert callers of vm_page_sleep_if_busy() to use vm_page_busy_sleep() instead. - Remove vm_page_sleep_if_(x)busy(). No functional change intended. Obtained from: jeff (object_concurrency patches) Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D32947	2021-11-15 13:01:30 -05:00
Mark Johnston	b0acc3f11b	vm_pager: Optimize an assertion Obtained from: jeff (object_concurrency patches) Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D32946	2021-11-15 13:01:30 -05:00
Mark Johnston	e4bdb6857a	vm_page: Handle VM_ALLOC_NORECLAIM in the contiguous page allocator We added _NORECLAIM to request that kmem_alloc_contig_pages() not spend time scanning physical memory for candidates to reclaim. In some situations the scanning can induce large amounts of undesirable latency, and it's less important that the request be satisfied than it is that we not spend many milliseconds scanning. The problem extends to vm_reserv_reclaim_contig(), which unlike vm_reserv_reclaim() may have to scan the entire list of partially populated reservations. Use VM_ALLOC_NORECLAIM to request that this scan not be executed.[1] As a side effect, this fixes a regression in `02fb0585e7` ("vm_page: Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain()") where VM_ALLOC_CONTIG was not included in VPAC_FLAGS or VPANC_FLAGS even though it is not masked by kmem_alloc_contig_pages().[2] Reported by: gallatin [1], glebius [2] Reviewed by: alc, glebius, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32899	2021-11-11 14:26:41 -05:00
Gordon Bergling	c28e39c3d6	Fix a common typo in syctl descriptions - s/maxiumum/maximum/ MFC after: 3 days	2021-11-03 20:49:24 +01:00
Mark Johnston	7585c5db25	uma: Fix handling of reserves in zone_import() Kegs with no items reserved have uk_reserve = 0. So the check keg->uk_reserve >= dom->ud_free_items will be true once all slabs are depleted. Then, rather than go and allocate a fresh slab, we return to the cache layer. The intent was to do this only when the keg actually has a reserve, so modify the check to verify this first. Another approach would be to make uk_reserve signed and set it to -1 until uma_zone_reserve() is called, but this requires a few casts elsewhere. Fixes: `1b2dcc8c54` ("uma: Avoid depleting keg reserves when filling a bucket") MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32516	2021-11-01 09:51:43 -04:00
Mark Johnston	fab343a716	uma: Improve M_USE_RESERVE handling in keg_fetch_slab() M_USE_RESERVE is used in a couple of places in the VM to avoid unbounded recursion when the direct map is not available, as is the case on 32-bit platforms or when certain kernel sanitizers (KASAN and KMSAN) are enabled. For example, to allocate KVA, the kernel might allocate a kernel map entry, which might require a new slab, which requires KVA. For these zones, we use uma_prealloc() to populate a reserve of items, and then in certain serialized contexts M_USE_RESERVE can be used to guarantee a successful allocation. uma_prealloc() allocates the requested number of items, distributing them evenly among NUMA domains. Thus, in a first-touch zone, to satisfy an M_USE_RESERVE allocation we might have to check the slab lists of other domains than the current one to provide the semantics expected by consumers. So, try harder to find an item if M_USE_RESERVE is specified and the keg doesn't have anything for current (first-touch) domain. Specifically, fall back to a round-robin slab allocation. This change fixes boot-time panics on NUMA systems with KASAN or KMSAN enabled.[1] Alternately we could have uma_prealloc() allocate the requested number of items for each domain, but for some existing consumers this would be quite wasteful. In general I think keg_fetch_slab() should try harder to find free slabs in other domains before trying to allocate fresh ones, but let's limit this to M_USE_RESERVE for now. Also fix a separate problem that I noticed: in a non-round-robin slab allocation with M_WAITOK, rather than sleeping after a failed slab allocation we simply try again. Call vm_wait_domain() before retrying. Reported by: mjg, tuexen [1] Reviewed by: alc MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32515	2021-11-01 09:51:18 -04:00
Konstantin Belousov	350fc36b4c	sysctl vm.objects: yield if hog Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31163	2021-10-25 20:34:02 +03:00
Konstantin Belousov	7738118e9a	vm.objects_swap: disable reporting some information For making the call faster, do not count active/inactive object queues, and do not report vnode info if any (for tmpfs). Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31163	2021-10-25 20:34:01 +03:00
Konstantin Belousov	42812ccc96	Add vm.swap_objects sysctl Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31163	2021-10-25 20:34:01 +03:00
Konstantin Belousov	1b610624fd	vm_object_list: split sysctl handler in separate function Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31163	2021-10-25 20:34:01 +03:00
Mark Johnston	d7acbe481d	vm_page: Break reservations to handle noobj allocations vm_reserv_reclaim_*() will release pages to the default freepool, not the direct freepool from which noobj allocations are drawn. But if both pools are empty, the noobj allocator variants must break reservations to make progress. Reported by: cy Reviewed by: kib (previous version) Fixes: `b498f71bc5` ("vm_page: Add a new page allocator interface for unnamed pages") Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32592	2021-10-22 09:25:59 -04:00
Mark Johnston	a9d6f1fe0a	Remove some remaining references to VM_ALLOC_NOOBJ Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32037	2021-10-19 21:22:56 -04:00
Mark Johnston	b801c79dda	vm_fault: Stop specifying VM_ALLOC_ZERO Now vm_page_alloc() and friends will unconditionally preserve PG_ZERO, so there is no point in setting this flag. Eliminate a local variable and add a comment explaining why we prioritize the allocation when the process is doomed. No functional change intended. Reviewed by: kib, alc Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32036	2021-10-19 21:22:56 -04:00
Mark Johnston	02fb0585e7	vm_page: Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain() As in vm_page_alloc_domain_after(), unconditionally preserve PG_ZERO. Implement vm_page_alloc_noobj_contig_domain(). Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32034	2021-10-19 21:22:56 -04:00
Mark Johnston	c40cf9bc62	vm_page: Stop handling VM_ALLOC_NOOBJ in vm_page_alloc_domain_after() This makes the allocator simpler since it can assume object != NULL. Also modify the function to unconditionally preserve PG_ZERO, so VM_ALLOC_ZERO is effectively ignored (and still must be implemented by the caller for now). Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32033	2021-10-19 21:22:56 -04:00
Mark Johnston	84c3922243	Convert consumers to vm_page_alloc_noobj_contig() Remove now-unneeded page zeroing. No functional change intended. Reviewed by: alc, hselasky, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32006	2021-10-19 21:22:56 -04:00
Mark Johnston	92db9f3bb7	Introduce vm_page_alloc_noobj_contig() This is the same as vm_page_alloc_noobj(), but allocates physically contiguous runs of memory. For now it is implemented in terms of vm_page_alloc_contig(), with the difference that vm_page_alloc_noobj_contig() implements VM_ALLOC_ZERO by zeroing the page. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32005	2021-10-19 21:22:56 -04:00
Mark Johnston	a4667e09e6	Convert vm_page_alloc() callers to use vm_page_alloc_noobj(). Remove page zeroing code from consumers and stop specifying VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to simply use VM_ALLOC_WAITOK. Similarly, convert vm_page_alloc_domain() callers. Note that callers are now responsible for assigning the pindex. Reviewed by: alc, hselasky, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31986	2021-10-19 21:22:56 -04:00
Mark Johnston	b498f71bc5	vm_page: Add a new page allocator interface for unnamed pages The diff adds vm_page_alloc_noobj() and vm_page_alloc_noobj_domain(). These mostly correspond to vm_page_alloc() and vm_page_alloc_domain() when no VM object is specified, with the exception that they handle VM_ALLOC_ZERO by zeroing the page, rather than by preserving PG_ZERO. This simplifies callers and will permit simplification of the vm_page_alloc_domain() definition. Since the new allocator variant is similar to vm_page_alloc_freelist(), implement both of them using a common backend allocator function. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31985	2021-10-19 21:22:55 -04:00
Mark Johnston	a23e6a1078	vm_page: Move vm_page_alloc_check() to after page allocator definitions This way all of the vm_page_alloc_*() allocator functions are grouped together. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-10-19 21:22:50 -04:00
Edward Tomasz Napierala	0f559a9f09	Make vmdaemon timeout configurable Make vmdaemon timeout configurable, so that one can adjust how often it runs. Here's a trick: set this to 1, then run 'limits -m 0 sh', then run whatever you want with 'ktrace -it XXX', and observe how the working set changes over time. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D22038	2021-10-17 13:49:29 +01:00
Dawid Gorecki	889b56c8cd	setrlimit: Take stack gap into account. Calling setrlimit with stack gap enabled and with low values of stack resource limit often caused the program to abort immediately after exiting the syscall. This happened due to the fact that the resource limit was calculated assuming that the stack started at sv_usrstack, while with stack gap enabled the stack is moved by a random number of bytes. Save information about stack size in struct vmspace and adjust the rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max, then the value is truncated to rlim_max. PR: 253208 Reviewed by: kib Obtained from: Semihalf Sponsored by: Stormshield MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D31516	2021-10-15 10:21:47 +02:00
Warner Losh	cdccd11b36	forward declare struct thread sys/sysctl.h moved struct thread forward declaration under #ifdef _KERNEL and so this header fails when included from userland. Add a forward declaration here. Fixes: `99eefc727e` Sponsored by: Netflix	2021-10-11 12:59:39 -06:00
Konstantin Belousov	174aad047e	vm_fault: do not trigger OOM too early Wakeup in vm_waitpfault() does not mean that the thread would get the page on the next vm_page_alloc() call, other thread might steal the free page we were waiting for. On the other hand, this wakeup might come much earlier than just vm_pfault_oom_wait seconds, if the rate of the page reclamation is high enough. If wakeups come fast and we loose the allocation race enough times, OOM could be undeservably triggered much earlier than vm_pfault_oom_attempts x vm_pfault_oom_wait seconds. Fix it by not counting the number of sleeps, but measuring the time to th first allocation failure, and triggering OOM when it was older than oom_attempts x oom_wait seconds. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D32287	2021-10-08 12:24:46 +03:00
Mitchell Horne	31991a5a45	minidump: De-duplicate is_dumpable() The function is identical in each minidump implementation, so move it to vm_phys.c. The only slight exception is powerpc where the function was public, for use in moea64_scan_pmap(). Reviewed by: kib, markj, imp (earlier version) MFC after: 2 weeks Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D31884	2021-09-29 16:41:52 -03:00
Gleb Smirnoff	183f8e1e57	Externalize nsw_cluster_max and initialize it early. GEOM_ELI needs to know the value, cause it will soon have special memory handling for IO operations associated with swap. Move initialization to swap_pager_init(), which is executed at SI_SUB_VM, unlike swap_pager_swap_init(), which would be executed only when a swap is configured. GEOM_ELI might need the value at SI_SUB_DRIVERS, when disks are tasted by GEOM. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D24400	2021-09-28 11:23:52 -07:00
Gleb Smirnoff	c6213beff4	Add flag BIO_SWAP to mark IOs that are associated with swap. Submitted by: jtl Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D24400	2021-09-28 11:23:51 -07:00
Konstantin Belousov	bd3a668087	vm_page_startup: correct calculation of the starting page Also avoid unneded calculations when phys segment end is the phys_avail[] start. Submitted by: alc Reviewed by: markj MFC after: 1 week Fixes: `181bfb42fd` Differential revision: https://reviews.freebsd.org/D32009	2021-09-19 21:27:55 +03:00
Mark Johnston	d6e77cda9b	uma: Show the count of free slabs in each per-domain keg's sysctl tree This is useful for measuring the number of pages that could be freed from a NOFREE zone under memory pressure. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-09-17 14:19:05 -04:00
Konstantin Belousov	181bfb42fd	vm_phys: do not ignore phys_avail[] segments that do not fit completely into vm_phys segments If phys_avail[] segment only intersect with some vm_phys segment, add pages from it to the free list that belong to the given vm_phys_seg, instead of dropping them. The vm_phys segments are generally result of subdivision of phys_avail segments, for instance DMA32 or LOWMEM boundaries split them. On amd64, after UEFI in-place kernel activation (copy_staging disable) was enabled, we typically have a large phys_avail[] segment below 4G which crosses LOWMEM (1M) boundary. With the current way of requiring phys_avail[] fully fit into vm_phys_seg, this memory was ignored. Reported by: madpilot Reviewed by: markj Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31958	2021-09-16 20:01:19 +03:00
Mark Johnston	686aa9287c	swap_pager: Handle large swap_pager_reserve() requests This interface is used solely by md(4) when the MD_RESERVE flag is specified, as in `mdconfig -a -t swap -s 1G -o reserve`. It pre-allocates swap blocks for the entire object. The number of blocks to be reserved is specified as a vm_size_t, but swp_pager_getswapspace() can allocate at most INT_MAX blocks. vm_size_t also seems like the incorrect type to use here it refers only to the size of the VM object, not the size of a mapping. So: - change the type of "size" in swap_pager_reserve() to vm_pindex_t, and - clamp the requested number of blocks for a single swp_pager_getswapspace() call to INT_MAX. Reported by: syzkaller Reviewed by: dougm, alc, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31875	2021-09-07 14:04:50 -04:00
Bjoern A. Zeeb	eccb516db8	vm: use __func__ for the correct function name In `fee2a2fa39` the KASSERTs in vm_page_unwire_noq() changed from "vm_page_unwire" to "vm_page_unref". While the former no longer was part of that function the latter does not exist as a function and is highly confusing when hit when using tools to lookup the functions and not doing a full-text search. Use %s __func__ for printing the function name, as that will do the right thing as code moves around and functions get renamed. Hit: while debugging a wired page leak with linuxkpi/iwlwifi Sponsored by: The FreeBSD Foundation Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D31635	2021-08-22 17:43:12 +00:00
Gordon Bergling	fa7a635f7e	Fix a few typos in source code comments - s/becase/because/ MFC after: 5 days	2021-08-14 09:06:09 +02:00
Mark Johnston	100949103a	uma: Add KMSAN hooks For now, just hook the allocation path: upon allocation, items are marked as initialized (absent M_ZERO). Some zones are exempted from this when it would otherwise raise false positives. Use kmsan_orig() to update the origin map for UMA and malloc(9) allocations. This allows KMSAN to print the return address when an uninitialized UMA item is implicated in a report. For example: panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe Sponsored by: The FreeBSD Foundation	2021-08-10 21:27:54 -04:00
Mark Johnston	8978608832	amd64: Populate the KMSAN shadow maps and integrate with the VM - During boot, allocate PDP pages for the shadow maps. The region above KERNBASE is currently not shadowed. - Create a dummy shadow for the vm page array. For now, this array is not protected by the shadow map to help reduce kernel memory usage. - Grow shadows when growing the kernel map. - Increase the default kernel stack size when KMSAN is enabled. As with KASAN, sanitizer instrumentation appears to create stack frames large enough that the default value is not sufficient. - Disable UMA's use of the direct map when KMSAN is configured. KMSAN cannot validate the direct map. - Disable unmapped I/O when KMSAN configured. - Lower the limit on paging buffers when KMSAN is configured. Each buffer has a static MAXPHYS-sized allocation of KVA, which in turn eats 2*MAXPHYS of space in the shadow map. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31295	2021-08-10 21:27:53 -04:00
Ka Ho Ng	de2e152959	Add vnode_pager_purge_range(9) KPI This KPI is created in addition to the existing vnode_pager_setsize(9) KPI. The KPI is intended for file systems that are able to turn a range of file into sparse range, also known as hole-punching. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D27194	2021-08-05 22:52:26 +08:00
Konstantin Belousov	0ef5eee9d9	Add vn_lktype_write() and remove repetetive code that calculates vnode locking type for write. Reviewed by: khng, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31405	2021-08-04 19:40:13 +03:00
Konstantin Belousov	041b7317f7	Add pmap_vm_page_alloc_check() which is the place to put MD asserts about allocated pages. On amd64, verify that allocated page does not belong to the kernel (text, data) or early allocated pages. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31121	2021-07-31 16:53:42 +03:00
Mark Johnston	4e8e26a004	redzone: Raise a compile error if KASAN is configured redzone(9) does some munging of the allocation to insert redzones before and after a valid memory buffer, but KASAN does not know about this and will raise false positives if both are configured. Until this is fixed, do not allow both to be configured. Note that KASAN provides similar checking on its own but currently does not force the creation of redzones for all UMA allocations; this should be addressed as well. Sponsored by: The FreeBSD Foundation	2021-07-23 10:47:13 -04:00
Mark Johnston	b0dfc48684	uma: Fix a few problems with KASAN integration - Ensure that all items returned by UMA are aligned to KASAN_SHADOW_SCALE (8). This was true in practice since smaller alignments are not used by any consumers, but we should enforce it anyway. - Use a non-zero code for marking redzones that appear naturally in items that are not a multiple of the scale factor in size. Currently we do not modify keg layouts to force the creation of redzones. - Use a non-zero code for marking freed per-CPU items, otherwise accesses of freed per-CPU items are not detected by the runtime. Sponsored by: The FreeBSD Foundation	2021-07-09 20:38:50 -04:00
Konstantin Belousov	5b10e79edb	Un-staticise vm_page_init_page() Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30785	2021-06-17 16:58:44 +03:00
Mateusz Guzik	128e25842e	vm: add another pager private flag Move OBJ_SHADOWLIST around to let pager flags be next to each other. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D30258	2021-05-15 20:47:29 +00:00
Konstantin Belousov	28bc23ab92	tmpfs: dynamically register tmpfs pager Remove OBJT_SWAP_TMPFS. Move tmpfs-specific swap pager bits into tmpfs_subr.c. There is no longer any code to directly support tmpfs in sys/vm, most tmpfs knowledge is shared by non-anon swap object type implementation. The tmpfs-specific methods are provided by registered tmpfs pager, which inherits from the swap pager. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30168	2021-05-13 20:13:34 +03:00
Konstantin Belousov	b730fd30b7	vm: Add KPI to dynamically register pagers Pager is allowed to inherit part of its implementation from the existing pager, which is done by copying non-NULL virtual method slots. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30168	2021-05-13 20:12:29 +03:00
Konstantin Belousov	7079449b0b	sys/vm: remove several other uses of OBJT_SWAP_TMPFS Mostly in cases where OBJ_SWAP flag works as well, or by reversing the condition so that object types can be listed. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30168	2021-05-13 20:10:35 +03:00
Konstantin Belousov	3e7a11ca21	vm_object_set_memattr(): handle all object types without listing them explicitly This avoids the need to know all existing object types in advance, by the cost of loosing the assert that unknown object type is handled in a sane manner. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30168	2021-05-13 20:10:35 +03:00
Konstantin Belousov	00a3fe968b	vm_object_kvme_type(): reimplement by embedding kvme_type into pagerops Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30168	2021-05-13 20:10:35 +03:00
Mark Johnston	9246b3090c	fork: Suspend other threads if both RFPROC and RFMEM are not set Otherwise, a multithreaded parent process may trigger races in vm_forkproc() if one thread calls rfork() with RFMEM set and another calls rfork() without RFMEM. Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to see if the address space is shared. Reported by: syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com Reported by: syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30220	2021-05-13 08:33:23 -04:00
Mark Johnston	06d1fd9f42	swap_pager: Zero swap info before exporting to userspace Otherwise padding bytes are leaked. Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-05-12 12:52:05 -04:00
Konstantin Belousov	d474440ab3	Constify vm_pager-related virtual tables. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	4b8365d752	Add OBJT_SWAP_TMPFS pager This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager and generic vm code have to explicitly handle swap objects which are tmpfs vnode v_object, in the special ways. Replace (almost) all such places with proper methods. Since VM still needs a notion of the 'swap object', regardless of its use, add yet another type-classification flag OBJ_SWAP. Set it in vm_object_allocate() where other type-class flags are set. This change almost completely eliminates the knowledge of tmpfs from VM, and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	0d2dfc6fed	pagertab: use designated initializers Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	838adc533f	Style enum obj_type Put each type into dedicated line, which makes addition of new types cleaner. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	a7c198a24b	Implement vm_object_vnode() using vm_pager_getvp() Allow vp_heldp argument to be NULL, in which case the returned vnode is not held for tmpfs swap objects. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	1390a5cbeb	Add pgo_freespace method Makes the code in vm_object collapse/page_remove cleaner Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	192112b74f	Add pgo_getvp method This eliminates the staircase of conditions in vm_map_entry_set_vnode_text(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	c23c555bc1	Add pgo_mightbedirty method Used to implement vm_object_mightbedirty() Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	180bcaa46c	vm_pager: add pgo_set_writeable_dirty method specialized for swap and vnode pagers, and used to implement vm_object_set_writeable_dirty(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:03 +03:00
Konstantin Belousov	ee4211bca6	vm_pager: style some wrappers Fill lines with the function definitions. Use local var to shorten repeated extra-long expressions. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:02 +03:00
Konstantin Belousov	a0850dd057	swappagerops: slightly more style-compliant formatting Remove excess spaces from comments. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070	2021-05-07 17:08:02 +03:00
Mark Johnston	9a7c2de364	realloc: Fix KASAN(9) shadow map updates When copying from the old buffer to the new buffer, we don't know the requested size of the old allocation, but only the size of the allocation provided by UMA. This value is "alloc". Because the copy may access bytes in the old allocation's red zone, we must mark the full allocation valid in the shadow map. Do so using the correct size. Reported by: kp Tested by: kp Sponsored by: The FreeBSD Foundation	2021-05-05 17:12:51 -04:00
Alexander Motin	2760658b21	Improve UMA cache reclamation. When estimating working set size, measure only allocation batches, not free batches. Allocation and free patterns can be very different. For example, ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call, but it does not mean it will request the same amount back that fast too, in fact it won't. Update working set size on every reclamation call, shrinking caches faster under pressure. Lack of this caused repeating vm_lowmem events squeezing more and more memory out of real consumers only to make it stuck in UMA caches. I saw ZFS drop ARC size in half before previous algorithm after periodic WSS update decided to reclaim UMA caches. Introduce voluntary reclamation of UMA caches not used for a long time. For each zdom track longterm minimal cache size watermark, freeing some unused items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed memory can get better use by other consumers. For example, ZFS won't grow its ARC unless it see free memory, since it does not know it is not really used. And even if memory is not really needed, periodic free during inactivity periods should reduce its fragmentation. Reviewed by: markj, jeff (previous version) MFC after: 2 weeks Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D29790	2021-05-02 19:45:23 -04:00
Konstantin Belousov	ecfbddf0cd	sysctl vm.objects: report backing object and swap use For anonymous objects, provide a handle kvo_me naming the object, and report the handle of the backing object. This allows userspace to deconstruct the shadow chain. Right now the handle is the address of the object in KVA, but this is not guaranteed. For the same anonymous objects, report the swap space used for actually swapped out pages, in kvo_swapped field. I do not believe that it is useful to report full 64bit counter there, so only uint32_t value is returned, clamped to the max. For kinfo_vmentry, report anonymous object handle backing the entry, so that the shadow chain for the specific mapping can be deconstructed. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29771	2021-04-19 21:32:01 +03:00
Mark Johnston	aabe13f145	uma: Introduce per-domain reclamation functions Make it possible to reclaim items from a specific NUMA domain. - Add uma_zone_reclaim_domain() and uma_reclaim_domain(). - Permit parallel reclamations. Use a counter instead of a flag to synchronize with zone_dtor(). - Use the zone lock to protect cache_shrink() now that parallel reclaims can happen. - Add a sysctl that can be used to trigger reclamation from a specific domain. Currently the new KPIs are unused, so there should be no functional change. Reviewed by: mav MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29685	2021-04-14 13:03:34 -04:00
Mark Johnston	54f421f9e8	uma: Split bucket_cache_drain() to permit per-domain reclamation Note that the per-domain variant does not shrink the target bucket size. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-04-14 13:03:34 -04:00
Mark Johnston	2b914b85dd	kmem: Add KASAN state transitions Memory allocated with kmem_* is unmapped upon free, so KASAN doesn't provide a lot of benefit, but since allocations are always a multiple of the page size we can create a redzone when the allocation request size is not a multiple of the page size. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29458	2021-04-13 17:42:21 -04:00
Mark Johnston	244f3ec642	kstack: Add KASAN state transitions We allocate kernel stacks using a UMA cache zone. Cache zones have KASAN disabled by default, but in this case it makes sense to enable it. Reviewed by: andrew MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29457	2021-04-13 17:42:21 -04:00
Mark Johnston	09c8cb717d	uma: Add KASAN state transitions - Add a UMA_ZONE_NOKASAN flag to indicate that items from a particular zone should not be sanitized. This is applied implicitly for NOFREE and cache zones. - Add KASAN call backs which get invoked: 1) when a slab is imported into a keg 2) when an item is allocated from a zone 3) when an item is freed to a zone 4) when a slab is freed back to the VM In state transitions 1 and 3, memory is poisoned so that accesses will trigger a panic. In state transitions 2 and 4, memory is marked valid. - Disable trashing if KASAN is enabled. It just adds extra CPU overhead to catch problems that are detected by KASAN. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29456	2021-04-13 17:42:21 -04:00
Mark Johnston	982693bb72	vm_fault: Shoot down multiply mapped COW source page mappings Reviewed by: kib, rlibby Discussed with: alc Approved by: so Security: CVE-2021-29626 Security: FreeBSD-SA-21:08.vm	2021-04-06 14:49:28 -04:00
Konstantin Belousov	89619b747b	Add sysctl debug.uma_reclaim Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-04 20:39:06 +03:00
Konstantin Belousov	51a7be5f60	Style Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-04-04 20:39:06 +03:00
Jason A. Harmening	8dc8feb53d	Clean up a couple of MD warts in vm_fault_populate(): --Eliminate a big ifdef that encompassed all currently-supported architectures except mips and powerpc32. This applied to the case in which we've allocated a superpage but the pager-populated range is insufficient for a superpage mapping. For platforms that don't support superpages the check should be inexpensive as we shouldn't get a superpage in the first place. Make the normal-page fallback logic identical for all platforms and provide a simple implementation of pmap_ps_enabled() for MIPS and Book-E/AIM32 powerpc. --Apply the logic for handling pmap_enter() failure if a superpage mapping can't be supported due to additional protection policy. Use KERN_PROTECTION_FAILURE instead of KERN_FAILURE for this case, and note Intel PKU on amd64 as the first example of such protection policy. Reviewed by: kib, markj, bdragon Differential Revision: https://reviews.freebsd.org/D29439	2021-03-30 18:15:55 -07:00
Konstantin Belousov	c7b913aa47	vm_fault: handle KERN_PROTECTION_FAILURE pmap_enter(PMAP_ENTER_LARGEPAGE) may return KERN_PROTECTION_FAILURE due to PKRU inconsistency. Handle it in the call place from vm_fault_populate(), and in places which decode errors from vm_fault_populate()/ vm_fault_allocate(). Reviewed by: jah, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29442	2021-03-27 20:16:27 +02:00
Bryan Drewery	a771bf748f	Remove unused obj variable missed in r354870. Sponsored by: Dell EMC	2021-03-17 15:29:15 -07:00
Kristof Provost	b8f7267d49	uma: allow uma_zfree_pcu(..., NULL) We already allow free(NULL) and uma_zfree(..., NULL). Make uma_zfree_pcpu(..., NULL) work as well. This also means that counter_u64_free(NULL) will work. These make cleanup code simpler. MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29189	2021-03-12 12:12:35 +01:00
Mark Johnston	968079f253	vm_reserv: Fix list locking in vm_reserv_reclaim_contig() The per-domain partpop queue is locked by the combination of the per-domain lock and individual reservation mutexes. vm_reserv_reclaim_contig() scans the queue looking for partially populated reservations that can be reclaimed in order to satisfy the caller's allocation. During the scan, we drop the per-domain lock. At this point, the rvn pointer may be invalidated. Take care to load rvn after re-acquiring the per-domain lock. While here, simplify the condition used to check whether a reservation was dequeued while the per-domain lock was dropped. Reviewed by: alc, kib Reported by: gallatin MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29203	2021-03-11 10:35:35 -05:00
Mark Johnston	0401989282	vm: Round up npages and alignment for contig reclamation When searching for runs to reclaim, we need to ensure that the entire run will be added to the buddy allocator as a single unit. Otherwise, it will not be visible to vm_phys_alloc_contig() as it is currently implemented. This is a problem for allocation requests that are not a power of 2 in size, as with 9KB jumbo mbuf clusters. Reported by: alc Reviewed by: alc MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28924	2021-03-02 10:21:02 -05:00
Max Laier	14b5a3c7d5	vm pqbatch: move unmanaged page assert under pagequeue lock This KASSERT is overzealous because of the following race condition: 1) A managed page which is currently in PQ_LAUNDRY is freed. vm_page_free_prep calls vm_page_dequeue_deferred() The page state is: PQ_LAUNDRY, PGA_DEQUEUE\|PGA_ENQUEUED 2) The laundry worker comes around and pick up the page and calls vm_pageout_defer(m, PQ_LAUNDRY, true) to check if page is still in the queue. We do a vm_page_astate_load and get PQ_LAUNDRY, PGA_DEQUEUE\|PGA_ENQUEUED as per above. 3) The laundry worker is pre-empted and another thread allocates our page from the free pool. For example vm_page_alloc_domain_after calls vm_page_dequeue() and sets VPO_UNMANAGED because we are allocating for an OBJT_UNMANAGED object. The page state is: PQ_NONE, 0 - VPO_UNMANAGED 4) The laundry worker resumes, and processes vm_pageout_defer based on the stale astate which leads to a call to vm_page_pqbatch_submit, which will trip on the KASSERT. Submitted by: mlaier Reviewed by: markj, rlibby Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D28563	2021-02-24 15:56:16 -08:00
Mark Johnston	537f92cd35	uma: Update the comment above startup_alloc() to reflect reality The scheme used for early slab allocations changed in commit `a81c400e75`. Reported by: alc Reviewed by: alc MFC after: 1 week	2021-02-22 18:22:51 -05:00
Mark Johnston	23e875fd97	vm_kern: Avoid sign extension in the KVA_QUANTUM definition Otherwise, on a powerpc64 NUMA system with hashed page tables, the first-level superpage reservation size is large enough that the value of the kernel KVA arena import quantum, KVA_NUMA_IMPORT_QUANTUM, is negative and gets sign-extended when passed to vmem_set_import(). This results in a boot-time hang on such platforms. Reported by: bdragon MFC after: 3 days	2021-02-22 15:50:09 -05:00
Alex Richardson	fa2528ac64	Use atomic loads/stores when updating td->td_state KCSAN complains about racy accesses in the locking code. Those races are fine since they are inside a TD_SET_RUNNING() loop that expects the value to be changed by another CPU. Use relaxed atomic stores/loads to indicate that this variable can be written/read by multiple CPUs at the same time. This will also prevent the compiler from doing unexpected re-ordering. Reported by: GENERIC-KCSAN Test Plan: KCSAN no longer complains, kernel still runs fine. Reviewed By: markj, mjg (earlier version) Differential Revision: https://reviews.freebsd.org/D28569	2021-02-18 14:02:48 +00:00
John Baldwin	67932460c7	Add a VA_IS_CLEANMAP() macro. This macro returns true if a provided virtual address is contained in the kernel's clean submap. In CHERI kernels, the buffer cache and transient I/O map are allocated as separate regions. Abstracting this check reduces the diff relative to FreeBSD. It is perhaps slightly more readable as well. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D28710	2021-02-17 16:32:11 -08:00
Mark Johnston	5c18744ea9	vm: Honour the "noreuse" flag to vm_page_unwire_managed() This flag indicates that the page should be enqueued near the head of the inactive queue, skipping the LRU queue. It is used when unwiring pages from the buffer cache following direct I/O or after I/O when POSIX_FADV_NOREUSE or _DONTNEED advice was specified, or when sendfile(SF_NOCACHE) completes. For the direct I/O and sendfile cases we only enqueue the page if we decide not to free it, typically because it's mapped. Pass "noreuse" through to vm_page_release_toq() so that we actually honour the desired LRU policy for these scenarios. Reported by: bdrewery Reviewed by: alc, kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28555	2021-02-10 11:10:27 -05:00
Ryan Stone	660344ca44	Add a VM flag to prevent reclaim on a failed contig allocation If a M_WAITOK contig alloc fails, the VM subsystem will try to reclaim contiguous memory twice before actually failing the request. On a system with 64GB of RAM I've observed this take 400-500ms before it finally gives up, and I believe that this will only be worse on systems with even more memory. In certain contexts this delay is extremely harmful, so add a flag that will skip reclaim for allocation requests to allow those paths to opt-out of doing an expensive reclaim. Sponsored by: Dell Inc Differential Revision: https://reviews.freebsd.org/D28422 Reviewed by: markj, kib	2021-02-03 16:16:51 -05:00
Brooks Davis	7a1591c1b6	Rename kern_mmap_req to kern_mmap Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap. Reand rename kern_mmap_req to kern_mmap . The helper saved some code churn initially, but having multiple interfaces is sub-optimal. Obtained from: CheriBSD Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28292	2021-01-25 21:50:37 +00:00
Konstantin Belousov	420d4be3e4	vm_map_protect(): remove not needed recalculations of new_prot, new_maxprot Requested by: alc Sponsored by: The FreeBSD Foundation	2021-01-14 10:02:43 +02:00
Konstantin Belousov	0659df6fad	vm_map_protect: allow to set prot and max_prot in one go. This prevents a situation where other thread modifies map entries permissions between setting max_prot, then relocking, then setting prot, confusing the operation outcome. E.g. you can get an error that is not possible if operation is performed atomic. Also enable setting rwx for max_prot even if map does not allow to set effective rwx protection. Reviewed by: brooks, markj (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28117	2021-01-13 01:35:22 +02:00
Konstantin Belousov	9402bb44f1	vmspace_fork: preserve wx settings in the child vm map after fork Noted by: markj Sponsored by: The FreeBSD Foundation	2021-01-12 08:09:59 +02:00
Konstantin Belousov	2e1c94aa1f	Implement enforcing write XOR execute mapping policy. It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE \| PROT_EXEC are never specified together, if vm_map has MAP_WX flag set. FreeBSD control flag allows specific binary to request WX exempt, and there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/ disable globally. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28050	2021-01-12 01:15:43 +02:00
Mark Johnston	663de81f85	uma: Avoid unmapping direct-mapped slabs startup_alloc() uses pmap_map() to map slabs used for bootstrapping the VM. pmap_map() may ignore the hint address and simply return a range from the direct map. In this case we must not unmap the range in startup_free(). UMA uses bootstart and bootmem to track the range of KVA into which slabs are mapped if the direct map is not used. Unmap a startup slab only if it was mapped into that range. Reported by: alc Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27885	2021-01-03 11:50:31 -05:00
Ryan Libby	942951ba46	uma dbg: catch more corruption with atomics Use atomic testandset and testandclear to catch concurrent double free, and to reduce the number of atomic operations. Submitted by: jeff Reviewed by: cem, kib, markj (all previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22703	2020-12-31 13:02:45 -08:00
Mark Johnston	81846def34	vm: Fix some bugs in the page busying code In vm_page_busy_acquire(), load the object pointer using atomic_load_ptr() as we do elsewhere. Per the comment, the object identity must be consistent across sleeps. In vm_page_grab_sleep(), pass the correct pindex to _vm_page_busy_sleep(). The pindex is used to re-check the page's identity before going to sleep. In particular, vm_page_grab_sleep() is used in unlocked grab, so the object lock is not necessarily held when verifying the page's identity, and the pindex may change if the page is moved, or freed and re-allocated. I believe this can result in spurious VM_PAGER_FAILs from vm_page_grab_valid_unlocked() or early termination of vm_page_grab_pages_unlocked(). In vm_page_grab_pages(), pass the correct pindex to vm_page_grab_sleep(). Otherwise I believe vm_page_grab_pages() will effectively spin when attempting to busy a busy page after the first index in the range. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27607	2020-12-27 17:01:44 -05:00
Mark Johnston	d2f1c44bc9	uma: Remove the MINBUCKET flag from the flag name list This should have been done in r368399 / commit `f8b6c51538`. Reported by: rlibby Sponsored by: The FreeBSD Foundation	2020-12-27 17:01:33 -05:00
Bryan Drewery	5fee468e83	Revert r368523 which fixed contig allocs waiting forever. This needs to account for empty NUMA domains or domains which do not satisfy the requested range. Discussed with: markj	2020-12-15 19:38:16 +00:00
Bryan Drewery	bbfec1633b	contig allocs: Don't retry forever on M_WAITOK. This restores behavior from before domain iterators were added in r327895 and r327896. The vm_domainset_iter_policy() will do a vm_wait_doms() and then restart its iterator when M_WAITOK is set. It will also force the containing loop to have M_NOWAIT. So we get an unbounded retry loop rather than the intended bounded retries that kmem_alloc_contig_pages() already handles. This also restores M_WAITOK to the vmem_alloc() call in kmem_alloc_attr_domain() and kmem_alloc_contig_domain(). Reviewed by: markj, kib MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D27507	2020-12-10 20:44:29 +00:00
Mark Johnston	e574d407ae	uma: Make uma_zone_set_maxcache() work better with small limits The old implementation chose the largest bucket zone such that if the per-CPU caches are fully populated, the total number of items cached is no larger than the specified limit. If no such zone existed, UMA would not do any caching. We can now use uz_bucket_size_max to set a precise limit on the number of items in a zone's bucket, so the total size of per-CPU caches can be bounded more easily. Implement a new policy in uma_zone_set_maxcache(): choose a bucket size such that up to half of the limit can be cached in per-CPU caches, with the rest going to the full bucket cache. This fixes a problem with the kstack_cache zone: the limit of 4 * mp_ncpus items meant that the zone would not do any caching, defeating the whole purpose of the zone. That's because the smallest bucket size holds up to 2 items and we may cache up to 3 full buckets per CPU, and 2 * 3 * mp_ncpus > 4 * mp_ncpus. Reported by: mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27168	2020-12-06 22:45:50 +00:00
Mark Johnston	f8b6c51538	uma: Enforce the use of uz_bucket_size_max in the free path uz_bucket_size_max is the maximum permitted bucket size. When filling a new bucket to satisfy uma_zalloc(), the bucket is populated with at most uz_bucket_size_max items. The maximum number of entries in the bucket may be larger. When freeing items, however, we will fill per-CPPU buckets up to their maximum number of entries, potentially exceeding uz_bucket_size_max. This makes it difficult to precisely limit the number of items that may be cached in a zone. For example, if one wants to limit buckets to 1 entry for a particular zone, that's not possible since the smallest bucket holds up to 2 entries. Try to solve the problem by using uz_bucket_size_max to limit the number of entries in a bucket. Note that the ub_entries field is initialized upon every bucket allocation. Most zones are not affected since they do not impose any specific limit on the maximum bucket size. While here, remove the UMA_ZONE_MINBUCKET flag. It was unused and we now have uma_zone_set_maxcache() to control the zone's cache size more precisely. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27167	2020-12-06 22:45:39 +00:00
Mark Johnston	8a6776ca0f	uma: Use atomic load for uz_sleepers This field is updated locklessly. Sponsored by: The FreeBSD Foundation	2020-12-06 22:45:22 +00:00
Mark Johnston	991f23ef20	uma: Avoid allocating buckets with the cross-domain lock held Allocation of a bucket can trigger a cross-domain free in the bucket zone, e.g., if the per-CPU alloc bucket is empty, we free it and get migrated to a remote domain. This can lead to deadlocks since a bucket zone may allocate buckets from itself or a pair of bucket zones could be allocating from each other. Fix the problem by dropping the cross-domain lock before allocating a new bucket and handling refill races. Use a list of empty buckets to ensure that we can make forward progress. Reported by: imp, mjg (witness(9) warnings) Discussed with: jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27341	2020-11-30 16:18:33 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00

1 2 3 4 5 ...

4732 Commits