freebsd-skq

Author	SHA1	Message	Date
Jeff Roberson	be9d4fd6b4	(fault 8/9) Restructure some code to reduce duplication and simplify flow control. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23321	2020-01-23 05:22:02 +00:00
Jeff Roberson	df794f5caf	(fault 7/9) Move fault population and allocation into a dedicated function Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23320	2020-01-23 05:19:39 +00:00
Jeff Roberson	5909dafea9	(fault 6/9) Move getpages and associated logic into a dedicated function. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23311	2020-01-23 05:18:00 +00:00
Jeff Roberson	91eb2e908f	(fault 5/9) Move the backing_object traversal into a dedicated function. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23310	2020-01-23 05:14:41 +00:00
Jeff Roberson	5936b6a8f1	(fault 4/9) Move copy-on-write into a dedicated function. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23304	2020-01-23 05:11:01 +00:00
Jeff Roberson	fcb0475833	(fault 3/9) Move map relookup into a dedicated function. Add a new VM return code KERN_RESTART which means, deallocate and restart in fault. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23303	2020-01-23 05:07:01 +00:00
Jeff Roberson	c308a3a6c9	(fault 2/9) Move map lookup into a dedicated function. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23302	2020-01-23 05:05:39 +00:00
Jeff Roberson	2c2f4413cc	(fault 1/9) Move a handful of stack variables into the faultstate. This additionally fixes a potential bug/pessimization where we could fail to reload the original fault_type on restart. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23301	2020-01-23 05:03:34 +00:00
Ryan Libby	8d1c459ae5	uma: fix zone domain overlaying pcpu cache with disabled cpus UMA zone structures have two arrays at the end which are sized according to the machine: an array of CPU count length, and an array of NUMA domain count length. The CPU counting was wrong in the case where some CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the second array to be overlaid with the first. Reported by: olivier Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23318	2020-01-23 04:56:38 +00:00
Ryan Libby	7e2406774e	uma: report leaks more accurately Previously UMA had some false negatives in the leak report at keg destruction time, where it only reported leaks if there were free items in the slab layer (rather than allocated items), which notably would not be true for single-item slabs (large items). Now, report a leak if there are any allocated pages, and calculate and report the number of allocated items rather than free items. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23275	2020-01-23 04:56:34 +00:00
Jeff Roberson	91e31c3c08	Consistently use busy and vm_page_valid() rather than touching page bits directly. This improves API compliance, asserts, etc. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23283	2020-01-23 04:54:49 +00:00
Jeff Roberson	530cc6a25d	Some architectures with DMAP still consume boot kva. Simplify the test for claiming kva in uma_startup2() to handle this. Reported by: bdragon	2020-01-23 03:37:35 +00:00
Jeff Roberson	5949b1ca8c	Move readahead and dropbehind fault functionality into a helper routine for clarity. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23282	2020-01-21 00:12:57 +00:00
Jeff Roberson	1e40fe41c5	Reduce object locking in vm_fault. Once we have an exclusively busied page we no longer need an object lock. This reduces the longest hold times and eliminates some trylock code blocks. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23034	2020-01-20 22:49:52 +00:00
Jeff Roberson	d6e13f3b4d	Don't hold the object lock while calling getpages. The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033	2020-01-19 23:47:32 +00:00
Jeff Roberson	9c83ff2d86	It has not been possible to recursively terminate a vnode object for some time now. Eliminate the dead code that supports it. Approved by: kib, markj Differential Revision: https://reviews.freebsd.org/D22908	2020-01-19 18:36:03 +00:00
Jeff Roberson	98087a066f	Make collapse synchronization more explicit and allow it to complete during paging. Shadow objects are marked with a COLLAPSING flag while they are collapsing with their backing object. This gives us an explicit test rather than overloading paging-in-progress. While split is on-going we mark an object with SPLIT. These two operations will modify the swap tree so they must be serialized and swap_pager_getpages() can now directly detect these conditions and page more conservatively. Callers to vm_object_collapse() now will reliably wait for a collapse to finish so that the backing chain is as short as possible before other decisions are made that may inflate the object chain. For example, split, coalesce, etc. It is now safe to run fault concurrently with collapse. It is safe to increase or decrease paging in progress with no lock so long as there is another valid ref on increase. This change makes collapse more reliable as a secondary benefit. The primary benefit is making it safe to drop the object lock much earlier in fault or never acquire it at all. This was tested with a new shadow chain test script that uncovered long standing bugs and will be integrated with stress2. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22908	2020-01-19 18:30:23 +00:00
Andrew Gallatin	2052680238	pcpu_page_alloc: guard against empty NUMA domains Some systems, such as higher end Threadripper, may have NUMA domains with no physical memory, Don't allocate from these domains. This fixes a "panic: vm_wait in early boot" on my 2990WX desktop Reviewed by: jeff Sponsored by: Netflix	2020-01-18 18:25:37 +00:00
Jeff Roberson	5844774900	Fix a long standing bug that was made worse in r355765. When we are cowing a page that was previously mapped read-only it exists in pmap until pmap_enter() returns. However, we held no reference to the original page after the copy was complete. This allowed vm_object_scan_all_shadowed() to collapse an object that still had pages mapped. To resolve this, add another page pointer to the faultstate so we can keep the page xbusy until we're done with pmap_enter(). Handle busy pages in scan_all_shadowed. This is already done in vm_object_collapse_scan(). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23155	2020-01-17 03:44:04 +00:00
Jeff Roberson	a81c400e75	Simplify VM and UMA startup by eliminating boot pages. Instead use careful ordering to allocate early pages in the same way boot pages were but only as needed. After the KVA allocator has started up we allocate the KVA that we consumed during boot. This also makes the boot pages freeable since they have vm_page structures allocated with the rest of memory. Parts of this patch were written and tested by markj. Reviewed by: glebius, markj Differential Revision: https://reviews.freebsd.org/D23102	2020-01-16 05:01:21 +00:00
Alexander Motin	ace409ce9c	Restore loop break in vm_pageout_lowmem(). r355004 removed return statement from this loop with intention to also call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes infinite loop. Reviewed by: markj Sponsored by: iXsystems, Inc.	2020-01-14 03:27:57 +00:00
Ryan Libby	9b8db4d0a0	uma: split slabzone into two sizes By allowing more items per slab, we can improve memory efficiency for small allocs. If we were just to increase the bitmap size of the slabzone, we would then waste slabzone memory. So, split slabzone into two zones, one especially for 8-byte allocs (512 per slab). The practical effect should be reduced memory usage for counter(9). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23149	2020-01-14 02:14:15 +00:00
Ryan Libby	e63a1c2f52	uma: fixup some ktr messages Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23148	2020-01-14 02:13:46 +00:00
Mateusz Guzik	a314aba874	vm: add missing CLTFLAG_MPSAFE annotations This covers all vm/* files.	2020-01-12 05:08:57 +00:00
Gleb Smirnoff	9328cbc047	Always multiple vm.pgcache_zone_max to number of CPUs, and rename it respectively. The tunable controls how big is the size of per-cpu vm page cache. Previously the value was split for all CPUs in system, so configuring same value on machines with different count of CPUs yielded in different cache size available to a particular CPU. Reviewed by: markj Obtained from: Netflix	2020-01-10 19:32:08 +00:00
Mark Johnston	860bb7a04c	UMA: Don't destroy zones after the system shutdown process starts. Some kernel subsystems, notably ZFS, will destroy UMA zones from a shutdown eventhandler. This causes the zone to be drained. For slabs that are mapped into KVA this can be very expensive and so it needlessly delays the shutdown process. Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy() into a no-op, provided that the zone does not have a custom finalization routine. PR: 242427 Reviewed by: jeff, kib, rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23066	2020-01-09 19:17:42 +00:00
Ryan Libby	4a8b575c6b	uma: unify layout paths and improve efficiency Unify the keg layout selection paths (keg_small_init, keg_large_init, keg_cachespread_init), and slightly improve memory efficiecy by: - using the padding of the final item to store the slab header, - not going OFFPAGE if we have a choice unless it improves efficiency. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23048	2020-01-09 02:03:17 +00:00
Ryan Libby	54c5ae804f	uma: reorganize flags - Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC. - Move flag VTOSLAB from public to private. - Introduce public NOTPAGE flag and make HASH private. - Introduce public NOTOUCH flag and make OFFPAGE private. - Update man page. The net effect of this should be to make the contract with clients more clear. Clients should choose constraints, UMA will figure out how to implement them. This also breaks the confusing double meaning of OFFPAGE. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23016	2020-01-09 02:03:03 +00:00
Jeff Roberson	79c9f9429a	Fix uma boot pages calculations on NUMA machines that also don't have MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt of zones while here. This was already correct because uz_cpu strongly aligned the zone structure but the specified alignment did not match reality and involved redundant defines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23046	2020-01-06 02:51:19 +00:00
Jeff Roberson	bfb6b7a121	The fix in r356353 was insufficient. Not every architecture returns 0 for EARLY_COUNTER. Only amd64 seems to. Suggested by: markj Reported by: lwhsu Reviewed by: markj PR: 243117	2020-01-05 22:54:25 +00:00
Kyle Evans	2180f6c6f1	kern_mmap: restore character deleted in transit Pointy hat to: kevans X-MFC-With: r356359	2020-01-04 23:51:44 +00:00
Kyle Evans	18348a2369	kern_mmap: add a variant that allows caller to inspect fp Linux mmap rejects mmap() on a write-only file with EACCES. linux_mmap_common currently does a fun dance to grab the fp associated with the passed in fd, validates it, then drops the reference and calls into kern_mmap(). Doing so is perhaps both fragile and premature; there's still plenty of chance for the request to get rejected with a more appropriate error, and it's prone to a race where the file we ultimately mmap has changed after it drops its referenced. This change alleviates the need to do this by providing a kern_mmap variant that allows the caller to inspect the fp just before calling into the fileop layer. The callback takes flags, prot, and maxprot as one could imagine scenarios where any of these, in conjunction with the file itself, may influence a caller's decision. The file type check in the linux compat layer has been removed; EINVAL is seemingly not an appropriate response to the file not being a vnode or device. The fileop layer will reject the operation with ENODEV if it's not supported, which more closely matches the common linux description of mmap(2) return values. If we discover that we're allowing an mmap() on a file type that Linux normally wouldn't, we should restrict those explicitly. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22977	2020-01-04 23:39:58 +00:00
Jeff Roberson	31c251a046	Fix an assertion introduced in r356348. On architectures without UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that violated the new assert. Resolve this by rewriting the COLD asserts to look at the per-cpu allocation counts for evidence of api activity. Discussed with: rlibby Reviewed by: markj Reported by: lwhsu	2020-01-04 19:29:25 +00:00
Jeff Roberson	dfe13344f5	UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending on kernel configuration. API users need only specify one if they want to override the default. Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA is. Reviewed by: markj Discussed with: rlibby Differential Revision: https://reviews.freebsd.org/D22831	2020-01-04 18:48:13 +00:00
Jeff Roberson	91d947bfbe	Sort cross-domain frees into per-domain buckets before inserting these onto their respective bucket lists. This is a several order of magnitude improvement in contention on the keg lock under heavy free traffic while requiring only an additional bucket per-domain worth of memory. Discussed with: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22830	2020-01-04 07:56:28 +00:00
Jeff Roberson	8b987a7769	Use per-domain keg locks. This provides both a lock and separate space accounting for each NUMA domain. Independent keg domain locks are important with cross-domain frees. Hashed zones are non-numa and use a single keg lock to protect the hash table. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22829	2020-01-04 03:30:08 +00:00
Jeff Roberson	727c691857	Use a separate lock for the zone and keg. This provides concurrency between populating buckets from the slab layer and fetching full buckets from the zone layer. Eliminate some nonsense locking patterns where we lock to fetch a single variable. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22828	2020-01-04 03:15:34 +00:00
Jeff Roberson	4bd61e19a2	Use atomics for the zone limit and sleeper count. This relies on the sleepq to serialize sleepers. This patch retains the existing sleep/wakeup paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup in one case but otherwise should be bug for bug compatible. In particular, there are still various races surrounding adjusting the limit via sysctl that are now documented. Discussed with: markj Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22827	2020-01-04 03:04:46 +00:00
Mateusz Guzik	b249ce48ea	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427	2020-01-03 22:29:58 +00:00
Mark Johnston	f7607c300b	Clear queue operation flags when migrating a page to another queue. The page daemon loops may move pages back to the active queue if references are detected. In this case we must take care to clear existing queue operation flags. In particular, PGA_REQUEUE_HEAD may be set, and that flag is only valid if the page belongs to the inactive queue. Also fix a bug in the active queue scan where we were updating "old" instead of "new". This would only have been hit in rare cases where the page moved out of the active queue after the beginning of the scan. Reported by: Bob Prohaska, Idwer Vollering Tested by: Idwer Vollering Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D23001	2020-01-02 19:26:04 +00:00
Doug Moore	668a8aa83b	The map-entry clipping functions modify start and end entries of an entry in the vm_map, making invariants related to the max_free entry field invalid. Move the clipping work into vm_map_entry_link, so that linking is okay when the new entry clips a current entry, and the vm_map doesn't have to be briefly corrupted. Change assertions and conditions in SPLAY_{LEFT,RIGHT}_STEP since the max_free invariants can now be trusted in all cases. Tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D22897	2019-12-31 22:20:54 +00:00
Mark Johnston	758b2c02bb	Restore a vm_page_wired() check in vm_page_mvqueue() after r356156. We now set PGA_DEQUEUE on a managed page when it is wired after allocation, and vm_page_mvqueue() ignores pages with this flag set, ensuring that they do not end up in the page queues. However, this is not sufficient for managed fictitious pages or pages managed by the TTM. In particular, the TTM makes use of the plinks.q queue linkage fields for its own purposes. PR: 242961 Reported and tested by: Greg V <greg@unrelenting.technology>	2019-12-29 20:01:03 +00:00
Mark Johnston	9b888dd9bd	Clear queue op flags in vm_page_mvqueue(). This fixes a regression in r356155, introduced at the last minute. In particular, we must clear PGA_REQUEUE_HEAD before inserting into any queue besides PQ_INACTIVE since that operation is implemented only for PQ_INACTIVE. Reported by: pho, Jenkins via lwhsu	2019-12-29 15:39:43 +00:00
Mark Johnston	727150ff03	Remove some unused functions. The previous series of patches orphaned some vm_page functions, so remove them. Reviewed by: dougm, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22886	2019-12-28 19:04:29 +00:00
Mark Johnston	dc71caa037	Update the vm_page.h block comment to reflect recent changes. Explain the new locking rules for per-page queue state updates. Reviewed by: jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22884	2019-12-28 19:04:15 +00:00
Mark Johnston	9f5632e6c8	Remove page locking for queue operations. With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885	2019-12-28 19:04:00 +00:00
Mark Johnston	b7f30bff2f	Generalize lazy dequeue logic for wired pages. Some recent work aims to remove the use of the page lock for synchronizing updates to page queue state. This change adds a mechanism to preserve the existing behaviour of lazily dequeuing wired pages, which was previously synchronized using the page lock. Handle this by setting PGA_DEQUEUE when a managed page's wire count transitions from 0 to 1. When the page daemon encounters a page with a flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that page, but in so doing it does not modify the page itself and thus racing with a concurrent free of the page is harmless. The flag is advisory; the page daemon still checks for wirings after acquiring the object and page xbusy locks. vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition. It must do this before dropping the reference to avoid a use-after-free but also handles races with concurrent wirings to ensure that PGA_DEQUEUE is not left unset on a wired page. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22882	2019-12-28 19:03:46 +00:00
Mark Johnston	f3f38e2580	Start implementing queue state updates using fcmpset loops. This is in preparation for eliminating the use of the vm_page lock for protecting queue state operations. Introduce the vm_page_pqstate_commit_*() functions. These functions act as helpers around vm_page_astate_fcmpset() and are specialized for specific types of operations. vm_page_pqstate_commit() wraps these functions. Convert a number of routines to use these new helpers. Use vm_page_release_toq() in vm_page_unwire() and vm_page_release() to atomically release a wiring reference and release the page into a queue. This has the side effect that vm_page_unwire() will leave the page in the active queue if it is already present there. Convert the page queue scans to use the new helpers. Simplify vm_pageout_reinsert_inactive(), which requeues pages that were found to be busy during an inactive queue scan, to avoid duplicating the work of vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or PGA_REQUEUE_HEAD is set, let that be handled during batch processing. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22770 Differential Revision: https://reviews.freebsd.org/D22771 Differential Revision: https://reviews.freebsd.org/D22772 Differential Revision: https://reviews.freebsd.org/D22773 Differential Revision: https://reviews.freebsd.org/D22776	2019-12-28 19:03:32 +00:00
Mark Johnston	3c01c56b0e	Don't update per-page activation counts in the swapout code. This avoids duplicating the work of the page daemon's active queue scan. Moreover, this duplication was inconsistent: - PGA_REFERENCED is not counted in act_count unless pmap_ts_referenced() returned 0, but the page daemon always counts PGA_REFERENCED towards the activation count. - The swapout daemon always activates a referenced page, but the page daemon only does so when the containing object is mapped at least once. The main purpose of swapout_deactivate_pages() is to shrink the number of pages mapped into a given pmap. To do this without unmapping active pages, use the non-destructive pmap_is_referenced() instead of the destructive pmap_ts_referenced() and deactivate pages accordingly. This simplifies some future changes to the locking protocol for page queue state. Reviewed by: kib Discussed with: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22674	2019-12-28 19:03:17 +00:00
Konstantin Belousov	df8db6ddb9	vm_object_shadow(): fix object reference leak. In r355270 by me, vm_object_shadow() was changed to handle the reference counting for the shared case, but the extra reference that was done in vmspace_fork() for the shared/need_copy case was not removed. Submitted by: jeff	2019-12-28 16:40:44 +00:00
Mark Johnston	5541eb27d6	Remove some stale comments from the page allocator. Since r352110 the page lock is not required to wire pages in any context.	2019-12-27 23:19:21 +00:00
Jeff Roberson	ff5ce8a7a5	Fix a pair of bugs introduced in r356002. When we reclaim physical pages we allocate them with VM_ALLOC_NOOBJ which means they are not busy. For now move the busy assert for the new page in vm_page_replace into the public api and out of the private api used by contig reclaim. Fix another issue where we would leak busy if the page could not be removed from pmap. Reported by: pho Discussed with: markj	2019-12-27 01:50:16 +00:00
Jeff Roberson	cc7ce83ae0	Further reduce the cacheline footprint of fast allocations by duplicating the zone size and flags fields in the per-cpu caches. This allows fast alloctions to proceed only touching the single per-cpu cacheline and simplifies the common case when no ctor/dtor is specified. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22826	2019-12-25 20:57:24 +00:00
Jeff Roberson	376b1ba394	Optimize fast path allocations by storing bucket headers in the per-cpu cache area. This allows us to check on bucket space for all per-cpu buckets with a single cacheline access and fewer branches. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22825	2019-12-25 20:50:53 +00:00
Jeff Roberson	3639ac42e5	Fix a bug with _NUMA domains introduced in r339686. When M_NOWAIT is specified there was no loop termination condition in keg_fetch_slab(). Reported by: pho Reviewed by: markj	2019-12-25 19:26:35 +00:00
Jeff Roberson	7e1b379e1e	Don't unnecessarily relock the vm object after sleeps. This results in a surprising amount of object contention on loop restarts in fault. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22821	2019-12-24 18:38:06 +00:00
Doug Moore	b649c2ac34	Fix typo using RB_INITIALIZER. The macro RB_INITIALIZER ignores its argument, but is documented to require "&head" as argument to initialize "head". So using "_vm_phys_fictitious_tree" as the argument to initialize "vm_phys_fictitious_tree" is an inconsequential error, corrected here. Discussed with: alc	2019-12-22 21:53:05 +00:00
Jeff Roberson	419f0b1f95	Fix a bug introduced in r356002. Prior versions of this patchset had vm_page_remove() rather than !vm_page_wired() as the condition for free. When this changed back to wired the busy lock was leaked. Reported by: pho Reviewed by: markj	2019-12-22 20:35:50 +00:00
Jeff Roberson	3cf3b4e641	Make page busy state deterministic on free. Pages must be xbusy when removed from objects including calls to free. Pages must not be xbusy when freed and not on an object. Strengthen assertions to match these expectations. In practice very little code had to change busy handling to meet these rules but we can now make stronger guarantees to busy holders and avoid conditionally dropping busy in free. Refine vm_page_remove() and vm_page_replace() semantics now that we have stronger guarantees about busy state. This removes redundant and potentially problematic code that has proliferated. Discussed with: markj Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22822	2019-12-22 06:56:44 +00:00
Jeff Roberson	bef91632da	Move vm_fault busy logic into its own function for clarity and re-use by later changes. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22820	2019-12-22 04:21:16 +00:00
Mark Johnston	d07c571806	Fix VPO_UNMANAGED handling in vm_page_reclaim_run() after r353540. When allocating a replacement page we must clear VPO_UNMANAGED since we only ever reclaim pages from managed objects. vm_page_replace() does not handle this for us. Sprinkle some assertions to help catch this sort of issue. Reported by: pho Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22868	2019-12-21 19:04:05 +00:00
Mark Johnston	c2f22e9790	Fix the aflag shift on big-endian platforms after r355672. The structure offset is zero regardless of endianness. Reported by: brooks Pointy hat: markj	2019-12-18 01:56:38 +00:00
Jeff Roberson	61a74c5ccd	schedlock 1/4 Eliminate recursion from most thread_lock consumers. Return from sched_add() without the thread_lock held. This eliminates unnecessary atomics and lock word loads as well as reducing the hold time for scheduler locks. This will eventually allow for lockless remote adds. Discussed with: kib Reviewed by: jhb Tested by: pho Differential Revision: https://reviews.freebsd.org/D22626	2019-12-15 21:11:15 +00:00
Jeff Roberson	4bf95d00ce	Previously we did not support invalid pages in default objects. This means that if fault fails to progress and needs to restart the loop it must free the page it is working on and allocate again on restart. Resolve the few places that need to be modified to support this condition and simply deactivate the page. Presently, we only permit this when fault restarts for busy contention. This has an added benefit of removing some object trylocking in this case. While here consolidate some page cleanup logic into fault_page_free() and fault_page_release() to reduce redundant code and automate some teardown. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22653	2019-12-15 04:08:24 +00:00
Jeff Roberson	a808177864	Add a deferred free mechanism for freeing swap space that does not require an exclusive object lock. Previously swap space was freed on a best effort basis when a page that had valid swap was dirtied, thus invalidating the swap copy. This may be done inconsistently and requires the object lock which is not always convenient. Instead, track when swap space is present. The first dirty is responsible for deleting space or setting PGA_SWAP_FREE which will trigger background scans to free the swap space. Simplify the locking in vm_fault_dirty() now that we can reliably identify the first dirty. Discussed with: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22654	2019-12-15 03:15:06 +00:00
Jeff Roberson	d966c7615f	Slightly optimize locking in vm_map_copy_swap_entry(). Anonymous objects require the object lock to synchronize collapse. Other swap objects such as tmpfs do not. Reported by: mjg Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22747	2019-12-15 02:02:27 +00:00
Jeff Roberson	af00971419	Handle pagein clustering in vm_page_grab_valid() so that it can be used by exec_map_first_page(). This will also enable pagein clustering for other interested consumers (tmpfs, md, etc). Discussed with: alc Approved by: kib Differential Revision: https://reviews.freebsd.org/D22731	2019-12-15 02:00:32 +00:00
Ryan Libby	815db20425	uma dbg: flexible size for slab debug bitset too Recently (r355315) the size of the struct uma_slab bitset field us_free became dynamic instead of conservative. Now, make the debug bitset size dynamic too. The debug bitset is INVARIANTS-only, so in fact we don't care too much about the space savings that results from this, but enabling minimally-sized slabs on INVARIANTS builds is still important in order to be able to test new slab layouts effectively. Reviewed by: jeff (previous version), markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22759	2019-12-14 05:21:56 +00:00
Mark Johnston	325c4ced0d	Restore the reservation of boot pages for bucket zones after r355707. uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(), but before that assignment, startup_alloc() will use pages from the reserved pool, so the bucket zones themselves are still allocated using startup pages. Reviewed by: rlibby Reported by: Jenkins via lwhsu Differential Revision: https://reviews.freebsd.org/D22797	2019-12-13 18:28:01 +00:00
Ryan Libby	d82c8ffb16	Revert r355706 & r355710 The quick fix didn't work. I'll sort it out tomorrow. Revert r355710: "libmemstat: unbreak build" Revert r355706: "uma dbg: flexible size for slab debug bitset too"	2019-12-13 11:21:28 +00:00
Ryan Libby	f7af501519	uma: report slab efficiency Reviewed by: jeff Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22766	2019-12-13 09:32:09 +00:00
Ryan Libby	3182660a85	uma: delay bucket_init() until we might actually enable buckets This helps with a bootstrapping problem in upcoming work. We don't first enable buckets until uma_startup2(), so we can delay bucket creation until then. The other two paths to bucket_enable() are both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM) and one in uma_timeout() (first activated in uma_startup3()). Note that although some bucket functions are accessible before uma_startup2() (e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone. Discussed with: jeff Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22765	2019-12-13 09:32:03 +00:00
Ryan Libby	7508f15ff1	uma dbg: flexible size for slab debug bitset too Recently (r355315) the size of the struct uma_slab bitset field us_free became dynamic instead of conservative. Now, make the debug bitset size dynamic too. The debug bitset is INVARIANTS-only, so in fact we don't care too much about the space savings that results from this, but enabling minimally-sized slabs on INVARIANTS builds is still important in order to be able to test new slab layouts effectively. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22759	2019-12-13 09:31:59 +00:00
Mark Johnston	cbc080b4c4	Avoid relying on silent type casting in the native atomic_load_32. Reported by: np	2019-12-12 23:55:34 +00:00
Mark Johnston	6fbaf6859c	Implement atomic state updates using the new vm_page_astate_t structure. Introduce primitives vm_page_astate_load() and vm_page_astate_fcmpset() to operate on the 32-bit per-page atomic state. Modify vm_page_pqstate_fcmpset() to use them. No functional change intended. Introduce PGA_QUEUE_OP_MASK, a subset of PGA_QUEUE_STATE_MASK that only includes queue operation flags. This will be used in subsequent patches. Reviewed by: alc, jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22753	2019-12-12 21:13:20 +00:00
Doug Moore	037c0994bf	Extract code common to _vm_map_clip_start and _vm_map_clip_end into a function, vm_map_entry_clone, that can be invoked by each. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22760	2019-12-11 16:09:57 +00:00
Ryan Libby	6d204a6a0e	uma: pretty print zone flags sysctl Requested by: jeff Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22748	2019-12-11 06:50:55 +00:00
Mark Johnston	9be9ea420e	Add a helper function to the swapout daemon's deactivation code. vm_swapout_object_deactivate_pages() is renamed to vm_swapout_object_deactivate(), and the loop body is moved into the new vm_swapout_object_deactivate_page(). This makes the code a bit easier to follow and is in preparation for some functional changes. Reviewed by: jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22651	2019-12-10 18:15:20 +00:00
Mark Johnston	5cff1f4dc3	Introduce vm_page_astate. This is a 32-bit structure embedded in each vm_page, consisting mostly of page queue state. The use of a structure makes it easy to store a snapshot of a page's queue state in a stack variable and use cmpset loops to update that state without requiring the page lock. This change merely adds the structure and updates references to atomic state fields. No functional change intended. Reviewed by: alc, jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22650	2019-12-10 18:14:50 +00:00
Doug Moore	7887f00d2b	Revert r355505. The code that it allowed to compile has been removed.	2019-12-09 05:09:46 +00:00
Doug Moore	8b75b1ad0d	Define a vm_map method for user-space for advancing from a map entry to its successor in cases where examining a map entry requires a helper like kvm_read_all. Use that method, with kvm_read_all, to fix procstat_getfiles_kvm, which tries to find the successor now without using such a helper. This addresses a problem introduced by r355491. Reviewed by: markj (previous version) Discussed with: kib Differential Revision: https://reviews.freebsd.org/D22728	2019-12-08 22:33:51 +00:00
Mateusz Guzik	abd80ddb94	vfs: introduce v_irflag and make v_type smaller The current vnode layout is not smp-friendly by having frequently read data avoidably sharing cachelines with very frequently modified fields. In particular v_iflag inspected for VI_DOOMED can be found in the same line with v_usecount. Instead make it available in the same cacheline as the v_op, v_data and v_type which all get read all the time. v_type is avoidably 4 bytes while the necessary data will easily fit in 1. Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new flag field with a new value: VIRF_DOOMED. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D22715	2019-12-08 21:30:04 +00:00
Jeff Roberson	3b490537f4	Fix two problems with r355149. The sysctl name collision code assumed that zones would never be freed. In the case of tmpfs this was not true. While here test for the right bit to disable the keg related sysctls for zones that don't have kegs. Reported by: pho Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22655	2019-12-08 01:55:23 +00:00
Jeff Roberson	cff8481de4	It is safe to wire a page while the object is busy. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22636	2019-12-08 01:49:53 +00:00
Jeff Roberson	2306558c54	It is now safe to rename a page that is still on a queue. Allowing this is necessary for a forthcoming patch. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22636	2019-12-08 01:49:03 +00:00
Jeff Roberson	d8ad7b7d6a	Do not assert that the object lock is held in vm_object_set_writeable_dirty. A valid reference is all that is required. If we race with a deallocation we will harmlessly misidentify the type of an already dead object. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22636	2019-12-08 01:47:29 +00:00
Jeff Roberson	fb1d575ceb	Reduce duplication in grab functions by providing allocflags based inlines. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22635	2019-12-08 01:16:22 +00:00
Jeff Roberson	1e0701e1e5	Use a variant slab structure for offpage zones. This saves space in embedded slabs but also is an opportunity to tidy up code and add accessor inlines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22609	2019-12-08 01:15:06 +00:00
Mark Johnston	c0829bb1d6	Add casts required by the 32-bit build after r355491.	2019-12-08 00:02:36 +00:00
Mark Johnston	3098cd73a3	Provide vm_map_entry traversal routines to userspace. This is required for now to allow libprocstat to compile. Discussed with: dougm	2019-12-07 19:36:40 +00:00
Mateusz Guzik	91caa9b8c1	vm: fix sysctl vm.kstack_cache_size change report Cache gets resized correctly, but sysctl reports the wrong number: # sysctl vm.kstack_cache_size=512 vm.kstack_cache_size: 128 -> 128 patched: vm.kstack_cache_size: 128 -> 512 Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22717 Fixes: r355002 "Revise the page cache size policy."	2019-12-07 17:28:41 +00:00
Doug Moore	c1ad5342a6	Remove the next and prev fields from vm_map_entry, to save a bit of space. Where the vm_map tree now has null pointers, store pointers to next and previous entries in right and left fields, making the binary tree threaded. Have the predecessor and successor functions compute what the prev and next fields previously stored. Reviewed by: markj, kib (previous version) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D21964	2019-12-07 17:14:33 +00:00
Justin Hibbits	caef3e1280	powerpc/pmap: NUMA-ize vm_page_array on powerpc Summary: This matches r351198 from amd64. This only applies to AIM64 and Book-E. On AIM64 it short-circuits with one domain, to behave similar to existing. Otherwise it will allocate 16MB huge pages to hold the page array, across all NUMA domains. On the first domain it will shift the page array base up, to "upper-align" the page array in that domain, so as to reduce the number of pages from the next domain appearing in this domain. After the first domain, subsequent domains will be allocated in full 16MB pages, until the final domain, which can be short. This means some inner domains may have pages accounted in earlier domains. On Book-E the page array is setup at MMU bootstrap time so that it's always mapped in TLB1, on both 32-bit and 64-bit. This reduces the TLB0 overhead for touching the vm_page_array, which reduces up to one TLB miss per array access. Since page_range (vm_page_startup()) is no longer used on Book-E but is on 32-bit AIM, mark the variable as potentially unused, rather than using a nasty #if defined() list. Reviewed by: luporl Differential Revision: https://reviews.freebsd.org/D21449	2019-12-07 03:34:03 +00:00
Mark Johnston	a6f21d15dd	Fix fault_type handling in vm_map_lookup(). Suppose that the map entry is wired, so that we later assign fault_type = entry->protection. Suppose further that we jump back to RetryLookup. Then fault_type will no longer contain the original fault protection mask, but instead that of the wired entry. Submitted by: Wuyang Chung <wuyang.chung1@gmail.com> Reviewed by: kib MFC after: 3 days Github PR: https://github.com/freebsd/freebsd/pull/419 Differential Revision: https://reviews.freebsd.org/D22683	2019-12-06 23:39:08 +00:00
Mark Johnston	ed2f945a39	Fix an off-by-one error in vm_map_pmap_enter(). If the starting pindex is equal to object->size, there is nothing to do. This was harmless since the rest of vm_map_pmap_enter() has no effect when psize == 0. Submitted by: Wuyang Chung <wuyang.chung1@gmail.com> Reviewed by: alc, dougm, kib MFC after: 1 week Github PR: https://github.com/freebsd/freebsd/pull/417 Differential Revision: https://reviews.freebsd.org/D22678	2019-12-04 19:46:48 +00:00
Andrew Turner	b75c4efcd2	Fix the signature for zone_import and zone_release These are cast to uma_import and uma_release functions. Use the signature for these in the zone functions. This was found with an experimental Kernel CFI. It will complain if the signature is different than what a function pointer expects. The simplest way to fix these is to correct the signature. Reviewed by: rlibby Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22671	2019-12-04 18:40:05 +00:00
Jeff Roberson	9b78b1f433	Use a precise bit count for the slab free items in UMA. This significantly shrinks embedded slab structures. Reviewed by: markj, rlibby (prior version) Differential Revision: https://reviews.freebsd.org/D22584	2019-12-02 22:44:34 +00:00
Jeff Roberson	0f9e06e18b	Fix a few places that free a page from an object without busy held. This is tightening constraints on busy as a precursor to lockless page lookup and should largely be a NOP for these cases. Reviewed by: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22611	2019-12-02 22:42:05 +00:00
Konstantin Belousov	67388836f3	Store the bottom of the shadow chain in OBJ_ANON object->handle member. The handle value is stable for all shadow objects in the inheritance chain. This allows to avoid descending the shadow chain to get to the bottom of it in vm_map_entry_set_vnode_text(), and eliminate corresponding object relocking which appeared to be contending. Change vm_object_allocate_anon() and vm_object_shadow() to handle more of the cred/charge initialization for the new shadow object, in addition to set up the handle. Reported by: jeff Reviewed by: alc (previous version), jeff (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differrential revision: https://reviews.freebsd.org/D22541	2019-12-01 20:43:04 +00:00
Jeff Roberson	886b90219a	Restore swap space accounting for non-anonymous swap objects. This was broken in r355082. Reduce some locking in nearby related object type checks. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22565	2019-11-29 19:57:49 +00:00
Jeff Roberson	f2410510db	Avoid acquiring the object lock if color is already set. It can not be unset until the object is recycled so this check is stable. Now that we can acquire the ref without a lock it is not necessary to group these operations and we can avoid it entirely in many cases. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22565	2019-11-29 19:49:20 +00:00
Jeff Roberson	26c4e9831b	Fix a perf regression from r355122. We can use a shared lock to drop the last ref on vnodes. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22565	2019-11-29 19:47:40 +00:00
Jeff Roberson	6d6a03d7a8	Handle large mallocs by going directly to kmem. Taking a detour through UMA does not provide any additional value. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22563	2019-11-29 03:14:10 +00:00
Doug Moore	85b7bedb15	Functions that call vm_map_splay_merge sometimes set data fields (e.g. root->left = NULL) to affect the behavior of that function. This change stops that data manipulation, and instead calls a pair of functions, one for the left direction and the other for the right, with the function called depending whether or not we currently null the root child in that direction to control the behavior of vm_map_splay_merge. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22589	2019-11-29 02:06:45 +00:00
Jeff Roberson	584061b480	Garbage collect the mostly unused us_keg field. Use appropriately named union members in vm_page.h to store the zone and slab. Remove some nearby dead code. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22564	2019-11-28 07:49:25 +00:00
Ryan Libby	35ec24f362	uma: move sysctl vm.uma defn out from under INVARIANTS Fix non-INVARIANTS builds after r355149. Reported by: Michael Butler <imb@protected-networks.net> Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22588	2019-11-28 04:15:16 +00:00
Jeff Roberson	20a4e15451	Implement a sysctl tree for uma zones to assist in debugging and provide more statistcs than are exported via the ABI stable vmstat interface. Rename uz_count to uz_bucket_size because even I was confused by the name after returning to the source years later. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22554	2019-11-28 00:19:09 +00:00
Jeff Roberson	0a81b4395e	Refactor uma_zfree_arg into several functions to make control flow more clear and icache usage cleaner. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22491	2019-11-27 23:19:06 +00:00
Doug Moore	1867d2f2e9	Inline some splay helper functions to improve performance on a micro-benchmark. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D22544	2019-11-27 21:00:44 +00:00
Ryan Libby	ca293436d1	uma: trash memory when ctor/dtor supplied too On INVARIANTS kernels, UMA has a use-after-free detection mechanism. This mechanism previously required that all of the ctor/dtor/uminit/fini arguments to uma_zcreate() be NULL in order to function. Now, it only requires that uminit and fini be NULL; now, the trash ctor and dtor will be called in addition to any supplied ctor or dtor. Also do a little refactoring for readability of the resulting logic. This enables use-after-free detection for more zones, and will allow for simplification of some callers that worked around the previous restriction (see kern_mbuf.c). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20722	2019-11-27 19:49:55 +00:00
Jeff Roberson	a67d540832	Use atomics in more cases for object references. We now can completely omit the object lock if we are above a certain threshold. Hold only a single vnode reference when the vnode object has any ref > 0. This allows us to only lock the object and vnode on 0-1 and 1-0 transitions. Differential Revision: https://reviews.freebsd.org/D22452	2019-11-27 00:39:23 +00:00
Jeff Roberson	beb8beef81	Refactor uma_zalloc_arg(). It is a mess of gotos and code which doesn't make sense after many partial refactors. Attempt to make a smaller cache footprint for the fast path. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22470	2019-11-26 22:17:02 +00:00
Ryan Libby	6a14746c01	vm_object_collapse_scan_wait: drop locks before reacquiring Regression from r352174. In the vm_page_rename() failure case we forgot to unlock the vm object locks before sleeping and reacquiring them. Reviewed by: jeff Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22542	2019-11-25 07:38:31 +00:00
Jeff Roberson	4d987866e6	Move anonymous object copying for fork into its own routine and so that we can avoid locking non-anonymous objects. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22472	2019-11-25 07:13:05 +00:00
Doug Moore	2767c9f36a	Where 'current' is used to index over vm_map entries, use 'entry'. Where 'entry' is used to identify the starting point for iteration, use 'first_entry'. These are the naming conventions used in most of the vm_map.c code. Where VM_MAP_ENTRY_FOREACH can be used, do so. Squeeze a few lines to fit in 80 columns. Where lines are being modified for these reasons, look to remove style(9) violations. Reviewed by: alc, markj Differential Revision: https://reviews.freebsd.org/D22458	2019-11-25 02:19:47 +00:00
Konstantin Belousov	3236244936	Ignore object->handle for OBJ_ANON objects. Note that the change in vm_object_collapse() is arguably a correctness fix. We must not collapse into content-identity carrying objects. Reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D22467	2019-11-24 19:18:12 +00:00
Konstantin Belousov	b631c36f0d	Record part of the owner struct thread pointer into busy_lock. Record as much bits from curthread into busy_lock as fits. Low bits for struct thread * representation are zero due to struct and zone alignment, and they leave space for busy flags (perhaps except statically allocated thread0). Upper bits are not very interesting for assert, and in most practical situations recorded value should allow to manually identify the owner with certainity. Assert that unbusy is performed by the owner, except few places where unbusy is done in io completion handler. For this case, add _unchecked variants of asserts and unbusy primitives. Reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D22298	2019-11-24 19:12:23 +00:00
Mark Johnston	9c770a27ce	Simplify vm_pageout_init_domain() and add a "big picture" comment. Stop subtracting 1024/200 from vmd_page_count/200. I cannot see how such precise accounting can make a difference on modern systems. Add some explanation of what the page daemon does and how it handles memory shortages. Reviewed by: dougm Discussed with: jeff, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22396	2019-11-22 16:31:43 +00:00
Mark Johnston	8fc2550837	Reclaim memory from UMA if the page daemon is struggling. Use the UMA reclaim thread to asynchronously drain all caches if there is a severe shortage in a domain. Otherwise we only trigger UMA reclamation every 10s even when the system has completely run out of memory. Stop entirely draining the caches when one domain falls below its min threshold. In some workloads it is normal for one NUMA domain to end up being nearly depleted by kernel memory allocations, for example for the ZFS ARC. The domainset iterators skip domains below the vmd_min_free theshold on the first iteration, so we should allow that mechanism to limit further depletion of the domain's free pages before taking the extreme step of calling uma_reclaim(UMA_RECLAIM_DRAIN_CPU). Discussed with: jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22395	2019-11-22 16:31:30 +00:00
Mark Johnston	bf0d60af92	Update the checks in vm_page_zone_import(). - Remove the cnt == 1 check. UMA passes cnt == 1 when it has disabled per-CPU caching. In this case we might as well just allocate a single page and return it to the caller, since the caller is going to do exactly that anyway if the UMA cache allocation attempt fails. - Don't replenish caches if the domain is severely short on free pages. With large buckets we may otherwise quickly exacerbate a situation where the page daemon is failing to keep up. - Don't replenish caches if the calling thread belongs to the page daemon, which should avoid creating extra memory pressure when it is trying to free memory. Virtually all such allocations while occur in the context of laundering, where the laundry thread must allocate slabs for various swap and I/O-related UMA zones. Reviewed by: kib Discussed with: alc, jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22394	2019-11-22 16:31:10 +00:00
Mark Johnston	003cf08ba9	Revise the page cache size policy. In r353734 the use of the page caches was limited to systems with a relatively large amount of RAM per CPU. This was to mitigate some issues reported with the system not able to keep up with memory pressure in cases where it had been able to do so prior to the addition of the direct free pool cache. This change re-enables those caches. The change modifies uma_zone_set_maxcache(), which was introduced specifically for the page cache zones. Rather than using it to limit only the full bucket cache, have it also set uz_count_max to provide an upper bound on the per-CPU cache size that is consistent with the number of items requested. Remove its return value since it has no use. Enable the page cache zones unconditionally, and limit them to 0.1% of the domain's pages. The limit can be overridden by the vm.pgcache_zone_max tunable as before. Change the item size parameter passed to uma_zcache_create() to the correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page cache buckets to be adaptively sized, like the rest of UMA's caches. This also causes the initial bucket size to be small, so only systems which benefit from large caches will get them. Reviewed by: gallatin, jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22393	2019-11-22 16:30:47 +00:00
Mark Johnston	b378d29687	Fix locking in vm_reserv_reclaim_contig(). We were not properly handling the case where the trylock of the reservaton fails, in which case we could leak reservation lock. Introduce a marker reservation to implement precise scanning in vm_reserv_reclaim_contig(). Before, a race could result in early termination of the scan in rare situations. Use the marker's lock to serialize scans of the partpop queue so that a global marker structure can be used. Modify vm_reserv_reclaim_inactive() to handle the presence of a marker while minimizing the hold time of domain-global locks. Reviewed by: alc, jeff, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22392	2019-11-22 16:28:52 +00:00
Andrew Turner	09a65f9ff5	As with r354905 use uint16_t to store aflags on the stack and as function arguments as the aflags size in vm_page_t has increased. Sponsored by: DARPA, AFRL	2019-11-20 18:00:43 +00:00
Andrew Turner	ad216bc10d	Use atomic_load_16 to load aflags as it's a uint16_t after r354820. Sponsored by: DARPA, AFRL	2019-11-20 17:49:58 +00:00
Doug Moore	83704cc236	Instead of looking up a predecessor or successor to the current map entry, when that entry has been seen already, keep the already-looked-up value in a variable and use that instead of looking it up again. Approved by: alc, markj (earlier version), kib (earlier version) Differential Revision: https://reviews.freebsd.org/D22348	2019-11-20 16:06:48 +00:00
Jeff Roberson	71353f7a2f	When we set OFFPAGE to limit fragmentation we should also set VTOSLAB so that we avoid the hashtables. The hashtable is now only required if a zone is created with OFFPAGE specified initially, not internally. This flag signals to UMA that it can't touch the allocated memory and so can't store a slab pointer in the containing page. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22453	2019-11-20 01:57:33 +00:00
Jeff Roberson	51b867e56b	Only keep anonymous objects on shadow lists. This eliminates locking of globally visible objects when they are part of a backing chain. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22423	2019-11-20 00:31:14 +00:00
Jeff Roberson	7f935055d3	Remove unnecessary object locking from the vnode pager. Recent changes to busy/valid/dirty locking make these acquires redundant. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22186	2019-11-19 23:30:09 +00:00
Jeff Roberson	639676877b	Simplify anonymous memory handling with an OBJ_ANON flag. This eliminates reudundant complicated checks and additional locking required only for anonymous memory. Introduce vm_object_allocate_anon() to create these objects. DEFAULT and SWAP objects now have the correct settings for non-anonymous consumers and so individual consumers need not modify the default flags to create super-pages and avoid ONEMAPPING/NOSPLIT. Reviewed by: alc, dougm, kib, markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D22119	2019-11-19 23:19:43 +00:00
Doug Moore	8ecbf14b74	Drop the extra argument from swp_pager_meta_ctl and have it do lookup only. Rename it swp_pager_meta_lookup. Stop checking for obj->type == swap there and assert it instead. Make the caller responsible for the obj->type check. Move the meta_ctl 'pop' functionality to swap_pager_unswapped, the only place that uses it, and assume obj->type == swap there too. Assisted by: ota_j.email.ne.jp Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22437	2019-11-19 08:06:31 +00:00
Mark Johnston	fe6d5344c2	Group per-domain reservation data in the same structure. We currently have the per-domain partially populated reservation queues and the per-domain queue locks. Define a new per-domain padded structure to contain both of them. This puts the queue fields and lock in the same cache line and avoids the false sharing within the old queue array. Also fix field packing in the reservation structure. In many places we assume that a domain index fits in 8 bits, so we can do the same there as well. This reduces the size of the structure by 8 bytes. Update some comments while here. No functional change intended. Reviewed by: dougm, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22391	2019-11-18 18:25:51 +00:00
Mark Johnston	3a2ba9974d	Widen the vm_page aflags field to 16 bits. We are now out of aflags bits, whereas the "flags" field only makes use of five of its sixteen bits, so narrow "flags" to eight bits. I have no intention of adding a new aflag in the near future, but would like to combine the aflags, queue and act_count fields into a single atomically updated word. This will allow vm_page_pqstate_cmpset() to become much simpler and is a step towards eliminating the use of the page lock array in updating per-page queue state. The change modifies the layout of struct vm_page, so bump __FreeBSD_version. Reviewed by: alc, dougm, jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22397	2019-11-18 18:22:41 +00:00
Doug Moore	abdab7b633	Add a helper function for testing a swap block and freeing it if empty. Submitted by: ota_j.email.ne.jp Approved by: alc, kib, dougm Differential Revision: https://reviews.freebsd.org/D22402	2019-11-17 18:38:37 +00:00
Konstantin Belousov	156e865494	Add elf image flag to disable stack gap. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22379	2019-11-17 14:54:07 +00:00
Doug Moore	bdb90e7613	The loop in vm_map_protect that verifies that all transition map entries are stabilized, repeatedly verifies the same entry. Check each entry in turn. Reviewed by: kib (code only), alc Tested by: pho MFC after: 7 days Differential Revision: https://reviews.freebsd.org/D22405	2019-11-17 06:50:36 +00:00
Doug Moore	7cdcf86360	Define wrapper functions vm_map_entry_{succ,pred} to act as wrappers around entry->{next,prev} when those are used for ordered list traversal, and use those wrapper functions everywhere. Where the next field is used for maintaining a stack of deferred operations, #define defer_next to make that different usage clearer, and then use the 'right' pointer instead of 'next' for that purpose. Approved by: markj Tested by: pho (as part of a larger patch) Differential Revision: https://reviews.freebsd.org/D22347	2019-11-13 15:56:07 +00:00
Doug Moore	467057fcd9	swap_pager_meta_free() frees allocated blocks in a way that exploits the sparsity of allocated blocks in a range, without issuing an "are you there?" query for every block in the range. swap_pager_copy() is not so smart. Modify the implementation of swap_pager_meta_free() slightly so that swap_pager_copy() can use that smarter implementation too. Based on an observation of: Yoshihiro Ota (ota_j.email.ne.jp) Reviewed by: kib,alc Tested by: pho Differential Revision: https://reviews.freebsd.org/D22280	2019-11-11 16:59:49 +00:00
Konstantin Belousov	08034d1006	Include cache zones into zone_foreach() where appropriate. The r354367 is reverted since it is subsumed by this, more complete, approach. Suggested by: markj Reviewed by: alc. glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22242	2019-11-10 09:25:19 +00:00
Doug Moore	461587dc9b	For vm_map, #defining DIAGNOSTIC to turn on full assertion-based consistency checking slows performance dramatically. This change reduces the number of assertions checked by completely walking the vm_map tree only when the write-lock is released, and only then if the number of modifications to the tree since the last walk exceeds the number of tree nodes. Reviewed by: alc, kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22163	2019-11-09 17:08:27 +00:00
Mark Johnston	c95f8ed885	Drop Giant before sleeping on a busy page. Before the page busy code was converted to make direct use of sleepqueues, this was handled by _sleep(). Reported by: glebius Reviewed by: kib Sponsored by: The FreeBSD Foundation	2019-11-07 18:26:29 +00:00
Mark Johnston	be801aaaef	Fix a race in release_page(). Since r354156 we may call release_page() without the page's object lock held, specifically following the page copy during a CoW fault. release_page() must therefore unbusy the page only after scheduling the requeue, to avoid racing with a free of the page. Previously, the object lock prevented this race from occurring. Add some assertions that were helpful in tracking this down. Reported by: pho, syzkaller Tested by: pho Reviewed by: alc, jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22234	2019-11-06 16:59:16 +00:00
Konstantin Belousov	432fc36da1	Switch cache zones from early counters to real implementation. Early counter mock can be only used on BSP for amd64, when APs try to update it that causes random memory corruption. N.B. This is a temporary patch to plug the corruption for now, while a proper solution for handling cache zones in zone_foreach() is being developed. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation, Mellanox Technologies	2019-11-05 21:38:48 +00:00
Konstantin Belousov	afa7e88a18	vm_page_wire_mapped: explain why failure does not affect correctness. Reviewed by: markj (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D22196	2019-10-30 17:33:17 +00:00
Jeff Roberson	67d0e29304	Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY flag and use the same system. This enables further fault locking improvements by allowing more faults to proceed with a shared lock. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22116	2019-10-29 21:06:34 +00:00
Jeff Roberson	51df53213c	Use atomics and a shared object lock to protect the object reference count. Certain consumers still need to guarantee a stable reference so we can not switch entirely to atomics yet. Exclusive lock holders can still modify and examine the refcount without using the ref api. Reviewed by: kib Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21598	2019-10-29 20:58:46 +00:00
Jeff Roberson	4b3e066539	Drop the object lock earlier in fault and don't relock it after pmap_enter(). Recent changes in object and page locking have enabled more lock pushdown. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22036	2019-10-29 20:46:25 +00:00
Gleb Smirnoff	42a621624d	Add couple more assertions to vm_pager_assert_in(). The bogus page is not allowed at ends of the request, and all non-bogus pages must be consecutive. Reviewed by: kib	2019-10-25 16:59:54 +00:00
Andrew Gallatin	0dc59d7632	Add a tunable to set the pgcache zone's maxcache When it is set to 0 (the default), a heavy Netflix-style web workload suffers from heavy lock contention on the vm page free queue called from vm_page_zone_{import,release}() as the buckets are frequently drained. When setting the maxcache, this contention goes away. We should eventually try to autotune this, as well as make this zone eligable for uma_reclaim(). Reviewed by: alc, markj Not Objected to by: jeff Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22112	2019-10-24 18:39:05 +00:00
Mark Johnston	be2c561003	Modify release_page() to handle a missing fault page. r353890 introduced a case where we may call release_page() with fs.m == NULL, since the fault handler may now lock the vnode prior to allocating a page for a page-in. Reported by: jhb Reviewed by: kib MFC with: r353890 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22120	2019-10-23 20:39:21 +00:00
Mark Johnston	2f81c92e55	Check for bogus_page in vnode_pager_generic_getpages_done(). We now assert that a page is busy when updating its validity-tracking state, but bogus_page is not busied during a getpages operation. Reported by: syzkaller Reviewed by: alc, kib Discussed with: jeff MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22124	2019-10-23 18:00:22 +00:00
Mark Johnston	e6f1a58082	Verify identity after checking for WAITFAIL in vm_page_busy_acquire(). A caller that does not guarantee that a page's identity won't change while sleeping for a busy lock must specify either NOWAIT or WAITFAIL. Reported by: syzkaller Reviewed by: alc, kib Discussed with: jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22124	2019-10-23 17:58:19 +00:00
Konstantin Belousov	16b0c09225	Assert that vm_fault_lock_vnode() returns locked saved vnode. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D22113	2019-10-23 07:36:26 +00:00
Konstantin Belousov	5b87ecc643	Assert that vnode_pager_setsize() is called with the vnode exclusively locked except for filesystems that set the MNTK_VMSETSIZE_BUG, Set the flag for ZFS. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D21883	2019-10-22 16:21:24 +00:00
Konstantin Belousov	208b81bb05	Add VV_VMSIZEVNLOCK flag. The flag specifies that vm_fault() handler should check the vnode' vm_object size under the vnode lock. It is converted into the object' OBJ_SIZEVNLOCK flag in vnode_pager_alloc(). Tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D21883	2019-10-22 16:09:25 +00:00
Konstantin Belousov	0ddd3082a4	vm_fault(): extract code to lock the vnode into a helper vn_fault_lock_vnode(). Tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D21883	2019-10-22 15:59:16 +00:00
Mark Johnston	1de9724e55	Avoid reloading bucket pointers in uma_vm_zone_stats(). The correctness of per-CPU cache accounting in that function is dependent on reading per-CPU pointers exactly once. Ensure that the compiler does not emit multiple loads of those pointers. Reported and tested by: pho Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22081	2019-10-22 14:20:06 +00:00
Mark Johnston	d307bdcc2c	Further constrain the use of per-CPU caches for free pages. In low memory conditions a significant number of pages may end up stuck in the caches, and currently these caches cannot be reaped, leading to spurious memory allocation failures and OOM kills. So: - Take into account the fact that we may cache up to two full buckets of pages per CPU, not just one. - Increase the amount of RAM required per CPU to enable the caches. This is a temporary measure until the page cache management policy is improved. PR: 241048 Reported and tested by: Kevin Oberman <rkoberman@gmail.com> Reviewed by: alc, kib Discussed with: jeff MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22040	2019-10-18 17:36:42 +00:00
Mark Johnston	f822c9e287	Apply mapping protections to preloaded kernel modules on amd64. With an upcoming change the amd64 kernel will map preloaded files RW instead of RWX, so the kernel linker must adjust protections appropriately using pmap_change_prot(). Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21860	2019-10-18 13:56:45 +00:00
Konstantin Belousov	303fa05a1f	swapon_check_swzone(): use already calculated static variables. Submitted by: ota@j.email.ne.jp MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22065	2019-10-17 13:49:47 +00:00
Mark Johnston	01cef4caa7	Remove page locking from pmap_mincore(). After r352110 the page lock no longer protects a page's identity, so there is no purpose in locking the page in pmap_mincore(). Instead, if vm.mincore_mapped is set to the non-default value of 0, re-lookup the page after acquiring its object lock, which holds the page's identity stable. The change removes the last callers of vm_page_pa_tryrelock(), so remove it. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21823	2019-10-16 22:03:27 +00:00
Mark Johnston	d0c9294b81	Correct the range boundaries used by kern_mincore(). Reported by: alc Sponsored by: Netflix	2019-10-16 21:47:58 +00:00
Jeff Roberson	fff5403f84	(5/6) Move the VPO_NOSYNC to PGA_NOSYNC to eliminate the dependency on the object lock in vm_page_set_validclean(). Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21595	2019-10-15 03:48:22 +00:00
Jeff Roberson	0012f373e4	(4/6) Protect page valid with the busy lock. Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594	2019-10-15 03:45:41 +00:00
Jeff Roberson	205be21d99	(3/6) Add a shared object busy synchronization mechanism that blocks new page busy acquires while held. This allows code that would need to acquire and release a very large number of page busy locks to use the old mechanism where busy is only checked and not held. This comes at the cost of false positives but never false negatives which the single consumer, vm_fault_soft_fast(), handles. Reviewed by: kib Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21592	2019-10-15 03:41:36 +00:00
Jeff Roberson	8da1c09853	(2/6) Don't release xbusy in vm_page_remove(), defer to vm_page_free_prep(). This persists busy state across operations like rename and replace. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21549	2019-10-15 03:38:02 +00:00
Jeff Roberson	63e9755548	(1/6) Replace busy checks with acquires where it is trival to do so. This is the first in a series of patches that promotes the page busy field to a first class lock that no longer requires the object lock for consistency. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21548	2019-10-15 03:35:11 +00:00
Doug Moore	32731f2eb1	Correct a transcription error that broke GENERIC introduced in r353496.	2019-10-14 17:51:57 +00:00
Doug Moore	721899b1f1	Move the definition of _vm_map_assert_consistent so that it can use vm_map_free_{left,right} rather than re-implementing them. Use the VM_MAP_FOREACH macro where applicable. Fix some indentation. Suggested by: kib (in a comment on D21964) Tested by: pho (as part of D21964) Differential Revision: https://reviews.freebsd.org/D22011	2019-10-14 17:15:42 +00:00
Leandro Lupori	0ecc478b74	[PPC64] Initial kernel minidump implementation Based on POWER9BSD implementation, with all POWER9 specific code removed and addition of new methods in PPC64 MMU interface, to isolate platform specific code. Currently, the new methods are implemented on pseries and PowerNV (D21643). Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D21551	2019-10-14 13:04:04 +00:00
Konstantin Belousov	c31cec4552	Restore nofaulting operations after r352807 The TDP_NOFAULTING flag should be checked in vm_fault(), not in vm_fault_trap(). Otherwise kernel accesses to userspace, like vn_io_fault(), enter vm locking when it should not. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D21992	2019-10-13 06:56:45 +00:00
Conrad Meyer	0223790f8d	Fix braino in r353429 cy@ points out that I got parameter order backwards between definition and invocation of the helper function. He is totally correct. The earlier version of this patch predated the XFree column so this is one I introduced, rather than the original author. Submitted by: cy Reported by: cy X-MFC-With: r353429	2019-10-11 06:02:03 +00:00
Conrad Meyer	46d70077be	ddb: Add CSV option, sorting to 'show (malloc\|uma)' Add /i option for machine-parseable CSV output. This allows ready copy/ pasting into more sophisticated tooling outside of DDB. Add total zone size ("Memory Use") as a new column for UMA. For both, sort the displayed list on size (print the largest zones/types first). This is handy for quickly diagnosing "where has my memory gone?" at a high level. Submitted by: Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version) Sponsored by: Dell EMC Isilon	2019-10-11 01:31:31 +00:00
Doug Moore	2288078c5e	Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map. In case the implementation ever changes from using a chain of next pointers, then changing the macro definition will be necessary, but changing all the files that iterate over vm_map entries will not. Drop a counter in vm_object.c that would have an effect only if the vm_map entry count was wrong. Discussed with: alc Reviewed by: markj Tested by: pho (earlier version) Differential Revision: https://reviews.freebsd.org/D21882	2019-10-08 07:14:21 +00:00
Mark Johnston	4090e2170d	Assert that the PGA_{WRITEABLE,EXECUTABLE} flags do not leak. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21783	2019-10-07 23:31:17 +00:00
Mateusz Guzik	7b1fbc424a	vm: stop trylocking page queues in vm_page_pqbatch_submit About 11 minutes of poudriere -s -j 104 and probing on return value of trylocks reveals that over 10% of attempts fail, which in turn means there are more atomics performed than necessary. Trylocking was there to try preventing migration, but it's not very likely to happen if the lock is uncontested. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21925	2019-10-07 23:19:09 +00:00
Konstantin Belousov	df08823d07	Improve MD page fault handlers. Centralize calculation of signal and ucode delivered on unhandled page fault in new function vm_fault_trap(). MD trap_pfault() now almost always uses the signal numbers and error codes calculated in consistent MI way. This introduces the protection fault compatibility sysctls to all non-x86 architectures which did not have that bug, but apparently they were already much more wrong in selecting delivered signals on protection violations. Change the delivered signal for accesses to mapped area after the backing object was truncated. According to POSIX description for mmap(2): The system shall always zero-fill any partial page at the end of an object. Further, the system shall never write out any modified portions of the last page of an object which are beyond its end. References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. An implementation may generate SIGBUS signals when a reference would cause an error in the mapped object, such as out-of-space condition. Adjust according to the description, keeping the existing compatibility code for SIGSEGV/SIGBUS on protection failures. For situations where kernel cannot handle page fault due to resource limit enforcement, SIGBUS with a new error code BUS_OBJERR is delivered. Also, provide a new error code SEGV_PKUERR for SIGSEGV on amd64 due to protection key access violation. vm_fault_hold() is renamed to vm_fault(). Fixed some nits in trap_pfault()s like mis-interpreting Mach errors as errnos. Removed unneeded truncations of the fault addresses reported by hardware. PR: 211924 Reviewed by: alc Discussed with: jilles, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21566	2019-09-27 18:43:36 +00:00
Mark Johnston	7cc833c598	Fix a race in vm_page_swapqueue(). vm_page_swapqueue() atomically transitions a page between queues. To do so, it must hold the page queue lock for the old queue. However, once the queue index has been updated, the queue lock no longer protects the page's queue state. Thus, we must speculatively remove the page from the old queue before committing the queue state update, and roll back if the update fails. Reported and tested by: pho Reviewed by: kib Sponsored by: Intel, Netflix Differential Revision: https://reviews.freebsd.org/D21791	2019-09-27 16:46:08 +00:00
Mark Johnston	87e93ea6c3	Fix object locking in vm_object_unwire() after r352174. Now, vm_page_busy_sleep() expects the page's object to be locked. vm_object_unwire() does some unusual lazy locking of the object chain and keeps objects locked until a busy page is encountered or the loop terminates. When a busy page is encountered, rather than unlocking all but the "bottom-level" object, we must instead skip the object to which "tm" belongs. Reported and tested by: pho Reviewed by: kib Discussed with: jeff Sponsored by: Intel, Netflix Differential Revision: https://reviews.freebsd.org/D21790	2019-09-27 16:41:34 +00:00
Mark Johnston	2b93f779d2	Add some counters for per-VM page events. For now, just count batched page queue state operations. vm.stats.page.queue_ops counts the number of batch entries that successfully completed, while queue_nops counts entries that had no effect, which occurs when the queue operation had been completed before the batch entry was processed. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Intel, Netflix Differential Revision: https://reviews.freebsd.org/D21782	2019-09-25 17:08:35 +00:00
Mark Johnston	b119329d81	Complete the removal of the "wire_count" field from struct vm_page. Convert all remaining references to that field to "ref_count" and update comments accordingly. No functional change intended. Reviewed by: alc, kib Sponsored by: Intel, Netflix Differential Revision: https://reviews.freebsd.org/D21768	2019-09-25 16:11:35 +00:00
Allan Jude	4a9c211af5	sys/vm/vm_glue.c: Incorrect function name in panic string Use __func__ to avoid this issue in the future. Submitted by: Wuyang Chung <wuyang.chung1@gmail.com> Reviewed by: markj, emaste Obtained from: https://github.com/freebsd/freebsd/pull/410	2019-09-19 07:28:24 +00:00
Doug Moore	1399b98ebd	Remove dead code from vm_map_unlink_entry made dead by r351476, and also a no-longer-used enumerant. Reviewed by: alc Approved by: markj (mentor, implicit) Tested by: pho (as part of a larger change) Differential Revision: https://reviews.freebsd.org/D21668	2019-09-17 02:53:59 +00:00
Mateusz Guzik	a8c8e44bf0	vfs: manage mnt_ref with atomics New primitive is introduced to denote sections can operate locklessly on aspects of struct mount, but which can also be disabled if necessary. This provides an opportunity to start scaling common case modifications while providing stable state of the struct when facing unmount, write suspendion or other events. mnt_ref is the first counter to start being managed in this manner with the intent to make it per-cpu. Reviewed by: kib, jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21425	2019-09-16 21:31:02 +00:00
Mark Johnston	38547d5980	Assert that the refcount value is not VPRC_BLOCKED in vm_page_drop(). VPRC_BLOCKED is a refcount flag used to indicate that a thread is tearing down mappings of a page. When set, it causes attempts to wire a page via a pmap lookup to fail. It should never represent the last reference to a page, so assert this. Suggested by: kib Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639	2019-09-16 15:16:48 +00:00
Mark Johnston	923da43e7c	Fix a race in vm_page_dequeue_deferred_free() after r352110. This function loaded the page's queue index before setting PGA_DEQUEUE. In this window the page daemon may have deactivated the page, updating its queue index. Make the operation atomic using vm_page_pqstate_cmpset(); the page daemon will not modify the page once it observes that PGA_DEQUEUE is set. Reported and tested by: pho Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639	2019-09-16 15:12:49 +00:00
Mark Johnston	47aef898ea	Fix a page leak in vm_page_reclaim_run(). After r352110 the attempt to remove mappings of the page being replaced may fail if the page is wired. In this case we must free the replacement page. Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639	2019-09-16 15:09:31 +00:00
Mark Johnston	e8bcf6966b	Revert r352406, which contained changes I didn't intend to commit.	2019-09-16 15:04:45 +00:00
Mark Johnston	41fd4b9422	Fix a couple of nits in r352110. - Remove a dead variable from the amd64 pmap_extract_and_hold(). - Fix grammar in the vm_page_wire man page. Reported by: alc Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639	2019-09-16 15:03:12 +00:00
Hans Petter Selasky	11b57401e6	Use REFCOUNT_COUNT() to obtain refcount where appropriate. Refcount waiting will set some flag bits in the refcount value. Make sure these bits get cleared by using the REFCOUNT_COUNT() macro to obtain the actual refcount. Differential Revision: https://reviews.freebsd.org/D21620 Reviewed by: kib@, markj@ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-09-12 16:26:59 +00:00
Jeff Roberson	c75757481f	Replace redundant code with a few new vm_page_grab facilities: - VM_ALLOC_NOCREAT will grab without creating a page. - vm_page_grab_valid() will grab and page in if necessary. - vm_page_busy_acquire() automates some busy acquire loops. Discussed with: alc, kib, markj Tested by: pho (part of larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21546	2019-09-10 19:08:01 +00:00
Jeff Roberson	4cdea4a853	Use the sleepq lock rather than the page lock to protect against wakeup races with page busy state. The object lock is still used as an interlock to ensure that the identity stays valid. Most callers should use vm_page_sleep_if_busy() to handle the locking particulars. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21255	2019-09-10 18:27:45 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Konstantin Belousov	0e79619e1e	vm_object_deallocate(): Remove no longer needed code. We track text mappings explicitly, there is no removal of the text refs on the object deallocate any more, so tmpfs objects should not be treated specially. Doing so causes excess deref. Reported and tested by: gallatin Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 16:01:45 +00:00
Konstantin Belousov	4f49b435c8	vm_object_coalesce(): avoid extending any nosplit objects, not only that which back tmpfs nodes. Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21560	2019-09-07 15:58:48 +00:00
Konstantin Belousov	bf5661f4a1	madvise(MADV_FREE): Quick fix to time rewind. Don't free pages in a shadowing object. While this degrades MADV_FREE to a no-op (and we could, instead, choose to fall back to MADV_DONTNEED, at the cost of changing pmap_madvise), this is presently considered a temporary fix. We may prefer to risk a little fragmentation of the map by creating a zero/OBJT_DEFAULT entry over top of the existing object and, simultaneously, revert to the existing marking any pages in the former shadowing object in the advised region as reclaimable. At least one consumer of MADV_FREE (snmalloc) may use mmap() to construct zeroed pages "eventually" here anyway, so the fragmentation may be coming anyway. Submitted by: Nathaniel Filardo <nwf20@cl.cam.ac.uk> PR: 240061 Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21517	2019-09-04 20:28:16 +00:00
Kyle Evans	fe7bcbaf50	vm pager: writemapping accounting for OBJT_SWAP Currently writemapping accounting is only done for vnode_pager which does some accounting on the underlying vnode. Extend this to allow accounting to be possible for any of the pager types. New pageops are added to update/release writecount that need to be implemented for any pager wishing to do said accounting, and we implement these methods now for both vnode_pager (unchanged) and swap_pager. The primary motivation for this is to allow other systems with OBJT_SWAP objects to check if their objects have any write mappings and reject operations with EBUSY if so. posixshm will be the first to do so in order to reject adding write seals to the shmfd if any writable mappings exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456	2019-09-03 20:31:48 +00:00
Konstantin Belousov	fe69291ff4	Add procctl(PROC_STACKGAP_CTL) It allows a process to request that stack gap was not applied to its stacks, retroactively. Also it is possible to control the gaps in the process after exec. PR: 239894 Reviewed by: alc Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21352	2019-09-03 18:56:25 +00:00
Mark Johnston	7cdeaf3309	Add preliminary support for atomic updates of per-page queue state. Queue operations on a page use the page lock when updating the page to reflect the desired queue state, and the page queue lock when physically enqueuing or dequeuing a page. Multiple pages share a given page lock, but queue state is per-page; this false sharing results in heavy lock contention. Take a small step towards the use of atomic_cmpset to synchronize updates to per-page queue state by introducing vm_page_pqstate_cmpset() and using it in the page daemon. In the longer term the plan is to stop using the page lock to protect page identity and rely only on the object and page busy locks. However, since the page daemon avoids acquiring the object lock except when necessary, some synchronization with a concurrent free of the page is required. vm_page_pqstate_cmpset() can be used to ensure that queue state updates are successful only if the page is not scheduled for a dequeue, which is sufficient for the page daemon. Add vm_page_swapqueue(), which moves a page from one queue to another using vm_page_pqstate_cmpset(). Use it in the active queue scan, which does not use the object lock. Modify vm_page_dequeue_deferred() to use vm_page_pqstate_cmpset() as well. Reviewed by: kib Discussed with: jeff Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21257	2019-09-03 14:29:58 +00:00
Mark Johnston	9d75f0dc75	Map the vm_page array into KVA on amd64. r351198 allows the kernel to use domain-local memory to back the vm_page array (up to 2MB boundaries) and reserves a separate PML4 entry for that purpose. One consequence of that change is that the vm_page array is no longer present in minidumps, which only adds pages mapped above VM_MIN_KERNEL_ADDRESS. To avoid the friction caused by having kernel data structures mapped below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry. Reviewed by: kib Discussed with: jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21491	2019-09-03 13:18:51 +00:00
Mark Johnston	08cfa56ea3	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667	2019-09-01 22:22:43 +00:00
Konstantin Belousov	6470c8d3db	Rework v_object lifecycle for vnodes. Current implementation of vnode_create_vobject() and vnode_destroy_vobject() is written so that it prepared to handle the vm object destruction for live vnode. Practically, no filesystems use this, except for some remnants that were present in UFS till today. One of the consequences of that model is that each filesystem must call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result all of them get rid of the v_object in reclaim. Move the call to vnode_destroy_vobject() to vgonel() before VOP_RECLAIM(). This makes v_object stable: either the object is NULL, or it is valid vm object till the vnode reclamation. Remove code from vnode_create_vobject() to handle races with the parallel destruction. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21412	2019-08-29 07:50:25 +00:00
Mateusz Guzik	2614bd9699	vm: only lock tmpfs vnode shared in vm_object_deallocate Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21455	2019-08-28 19:28:27 +00:00
Mark Johnston	b5d239cb97	Wire pages in vm_page_grab() when appropriate. uiomove_object_page() and exec_map_first_page() would previously wire a page after having grabbed it. Ask vm_page_grab() to perform the wiring instead: this removes some redundant code, and is cheaper in the case where the requested page is not resident since the page allocator can be asked to initialize the page as wired, whereas a separate vm_page_wire() call requires the page lock. In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of vm_page_unwire(PQ_NONE). The latter ensures that the page is dequeued before returning, but this is unnecessary since vm_page_free() will trigger a batched dequeue of the page. Reviewed by: alc, kib Tested by: pho (part of a larger patch) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21440	2019-08-28 16:08:06 +00:00
Mark Johnston	a70e17eeca	Fix a few nits in vm_pqbatch_process_page(). - Don't bother masking off non-queue state flags when loading the page's atomic state, since it is only required for one of the function's assertions. Update the assertion instead. - Remove an incorrect comment regarding synchronization with the page daemon. The page daemon only ever checks for PGA_ENQUEUED with the page queue lock held. - When clearing requeue flags, only clear the flags that have been acted upon. Reviewed by: kib (previous version) Discussed with: alc Tested by: pho (part of a larger patch) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21368	2019-08-26 20:20:10 +00:00
Mark Johnston	b48d4efe75	Handle UMA_ANYDOMAIN in kstack_import(). The kernel thread stack zone performs first-touch allocations by default, and must handle the case where the local memory domain is empty. For most UMA zones this is handled in the keg layer, but cache zones currently must implement a policy for this case. Simply use a round-robin policy if UMA_ANYDOMAIN is passed. Reported and tested by: bcran Reviewed by: kib Sponsored by: The FreeBSD Foundation	2019-08-25 21:14:46 +00:00
Konstantin Belousov	783a68aa33	Move OBJT_VNODE specific code from vm_object_terminate() to vnode_destroy_vobject(). Reviewed by: alc, jeff (previous version), markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21357	2019-08-25 13:26:06 +00:00
Doug Moore	83ea714f4f	vm_map_simplify_entry considers merging an entry with its two neighbors, and is used in a way so that if entries a and b cannot be merged, we consider them twice, first not-merging a with its successor b, and then not-merging b with its predecessor a. This change replaces vm_map_simplify_entry with vm_map_try_merge_entries, which compares two adjacent entries only, and uses it to avoid duplicated merge-checks. Tested by: pho Reviewed by: alc Approved by: markj (implicit) Differential Revision: https://reviews.freebsd.org/D20814	2019-08-25 07:06:51 +00:00
Konstantin Belousov	a7751d328a	Make stack grow use the same gap as stack create. Store stack_guard_page * PAGE_SIZE into the gap->next_read field at the time of the stack creation. This makes the used guard size consistent between stack creation and stack grow time. Suggested by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21384	2019-08-24 14:29:13 +00:00
Mateusz Guzik	5b596b9fa5	Remove the obsolete pcpu_zone_ptr zone. It was only used by flowtable (removed in r321618). Sponsored by: The FreeBSD Foundation	2019-08-24 00:01:19 +00:00
Mark Johnston	f93670b7b9	Stop clearing page flags in vm_page_pqbatch_submit(). All existing callers guarantee that the page does not have a pre-existing dequeue pending. Thus, if the page is dequeued before pqbatch_submit() acquires the page queue lock, we do not need to do anything since vm_page_dequeue_complete() takes care of clearing all page queue state flags for us. With this change, vm_page_pqbatch_submit() has the nice property that it does not directly modify any fields in the page structure. Reviewed by: alc, kib Tested by: pho (part of a larger change) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21372	2019-08-23 19:53:11 +00:00
Mark Johnston	386eba08bd	Make vm_pqbatch_submit_page() externally visible. It will become useful for the page daemon to be able to directly create a batch queue entry for a page, and without modifying the page structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit() to keep the namespace consistent. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21369	2019-08-23 19:49:29 +00:00
Mark Johnston	8b90607f20	Simplify vm_page_dequeue() and fix an assertion. - Add a vm_pagequeue_remove() function to physically remove a page from its queue and update the queue length. - Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue() return NULL for dequeued pages. - Avoid unnecessarily reloading the queue index if vm_page_dequeue() loses a race with a concurrent queue operation. - Correct an always-true assertion: vm_page_dequeue() may be called from the page allocator with the page unlocked. The assertion m->order == VM_NFREEORDER simply tests whether the page has been removed from the vm_phys free lists; instead, check whether the page belongs to an object. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21341	2019-08-21 16:11:12 +00:00
Mark Johnston	acad79e66f	Unconditionally enable debug.vm_lowmem. It is useful for testing purposes to be able to drain UMA caches, so do not limit the sysctl to DIAGNOSTIC kernels. MFC after: 1 week Sponsored by: Netflix	2019-08-21 16:01:17 +00:00
Mark Johnston	930b195263	Don't requeue active pages in vm_swapout_object_deactivate_pages(). As of r332974 the page daemon does not requeue pages during a scan of the active queue, so there is not much value in doing so here either. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21343	2019-08-21 15:52:10 +00:00
Jeff Roberson	cf27e0d125	Use an atomic reference count for paging in progress so that callers do not require the object lock. Reviewed by: markj Tested by: pho (as part of a larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21311	2019-08-19 23:09:38 +00:00
Jeff Roberson	4153054a7c	Permit vm_pager_has_page() to run with a shared lock. Introduce VM_OBJECT_DROP/VM_OBJECT_PICKUP to handle functions that are called with uncertain lock state. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21310	2019-08-19 22:25:28 +00:00
Jeff Roberson	3e5e1b5135	Allocate amd64's page array using pages and page directory pages from the NUMA domain that the pages describe. Patch original from gallatin. Reviewed by: kib Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21252	2019-08-18 23:07:56 +00:00
Konstantin Belousov	bb9e2184f0	Change locking requirements for VOP_UNSET_TEXT(). Require the vnode to be locked for the VOP_UNSET_TEXT() call. This will be used by the following bug fix for a tmpfs issue. Tested by: sbruno, pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-18 20:24:52 +00:00
Jeff Roberson	3921068f1e	Remove unnecessary debugging from r351181 that caused powerpc build to fail. Tested by: make universe TARGETS=powerpc	2019-08-18 08:07:31 +00:00
Jeff Roberson	be3f5f298b	vm_phys_avail_find is only used on NUMA kernels. Fix a build error.	2019-08-18 07:43:15 +00:00
Jeff Roberson	b7565d44df	Encapsulate phys_avail manipulation in a set of simple routines. Add a NUMA aware boot time memory allocator that will be used to allocate early domain correct structures. Code partially submitted by gallatin. Reviewed by: gallatin, kib Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21251	2019-08-18 07:06:31 +00:00
Aleksandr Rybalko	6b821a7455	Check paddr for overflow. Fix panic on initialize of "vm reserv" per-superpage lock in case when RAM ends at upper boundary of address space. Observed on ARM32 board BPI-R2 (2GB RAM 0x80000000-0xffffffff). PR: 235362 Reviewed by: kib, markj, alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D21272	2019-08-16 19:27:05 +00:00
Konstantin Belousov	245139c69d	Fix OOM handling of some corner cases. In addition to pagedaemon initiating OOM, also do it from the vm_fault() internals. Namely, if the thread waits for a free page to satisfy page fault some preconfigured amount of time, trigger OOM. These triggers are rate-limited, due to a usual case of several threads of the same multi-threaded process to enter fault handler simultaneously. The faults from pagedaemon threads participate in the calculation of OOM rate, but are not under the limit. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13671	2019-08-16 09:43:49 +00:00
Jeff Roberson	2194393787	Move phys_avail definition into MI code. It is consumed in the MI layer and doing so adds more flexibility with less redundant code. Reviewed by: jhb, markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21250	2019-08-16 00:45:14 +00:00
Doug Moore	504f5e294e	swap_pager.c reserves 2 blocks for a bsd label. Change that 2 to the expression howmany(BBSIZE, PAGE_SIZE), where BBSIZE is the size of the boot block area. That can be less than 2 if PAGE_SIZE is big. swapon(8) has an option to trim (delete) all the blocks of a device at startup. However, if the first of those blocks is a bsd label, then trimming those blocks is destructive. Change swapon to leave the first BBSIZE bytes untrimmed. Update manual pages to reflect changes in how swapon and how it may be used, espeically in association with savecore. Reviewed by: alc Approved by: markj (mentor) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D21191	2019-08-15 02:30:44 +00:00
Konstantin Belousov	10ae16c7fe	Fix stack grow for init. During early stages of kern_exec(), including strings copyout, p_textvp for init is NULL. This prevented stack grow from working for init execution. Without stack gap enabled, initial stack segment size is enough for strings passed by kernel to init. With the gap enabled, the used address might fall out of the initial segment, which kills init. Exclude initproc from the check for contexts which should not cause stack grow in the target map. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-08 16:48:19 +00:00
Jeff Roberson	0b26119b21	Cache kernel stacks in UMA. This gives us NUMA support, better concurrency, and more statistics. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20931	2019-08-06 23:15:34 +00:00
Jeff Roberson	eda1b01647	Implement a MINBUCKET zone flag so we can use minimal caching on zones that may be expensive to cache. Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20930	2019-08-06 23:04:59 +00:00
Jeff Roberson	c168508655	Add two new kernel options to control memory locality on NUMA hardware. - UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that was freed on a different domain from where it was allocated. This is only used for UMA_ZONE_NUMA (first-touch) zones. - UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all zones. This tries to maintain locality for kernel memory. Reviewed by: gallatin, alc, kib Tested by: pho, gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20929	2019-08-06 21:50:34 +00:00
Mark Johnston	98549e2dc6	Centralize the logic in vfs_vmio_unwire() and sendfile_free_page(). Both of these functions atomically unwire a page, optionally attempt to free the page, and enqueue or requeue the page. Add functions vm_page_release() and vm_page_release_locked() to perform the same task. The latter must be called with the page's object lock held. As a side effect of this refactoring, the buffer cache will no longer attempt to free mapped pages when completing direct I/O. This is consistent with the handling of pages by sendfile(SF_NOCACHE). Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20986	2019-07-29 22:01:28 +00:00
Doug Moore	23612f0df3	In swap_pager_putpages, move the initialization of a free-blocks counter, and the final freeing of freed swap blocks, outside the region where an object lock is held. Correct some style(9) and spelling errors. Change a panic() to a KASSERT(). Change a boolean_t to a bool. Suggested by: alc Reviewed by: alc Approved by: kib, markj (mentors) Differential Revision: https://reviews.freebsd.org/D21093	2019-07-28 19:32:23 +00:00
Mark Johnston	b16e57a6c9	Rename vm_page_{import,release}() to vm_page_zone_{import,release}(). I would like to use the name vm_page_release() for a different purpose, and vm_page_{import,release}() are local to vm_page.c. Reviewed by: kib MFC after: 1 week	2019-07-20 18:25:41 +00:00
Doug Moore	312df2c1dd	Define vm_map_entry_in_transition to handle an in-transition map entry, combining code currently in vm_map_unwire and vm_map_wire_locked into a single function, called by each of them for entries in transition. Discussed with: kib, markj Reviewed by: alc Approved by: kib, markj (mentors, implicit) Tested by: pho Differential Revision: https://reviews.freebsd.org/D20833	2019-07-19 20:47:35 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
Mark Johnston	46736e306c	Elide the vm_reserv_free_page() call when PG_PCPU_CACHE is set. Pages with PG_PCPU_CACHE set cannot have been allocated from a reservation, so as an optimization, skip the call to vm_reserv_free_page() in this case. Otherwise, the access of the corresponding reservation structure often results in a cache miss. Reviewed by: alc, kib Discussed with: jeff MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20859	2019-07-08 19:02:40 +00:00
Mark Johnston	d9a73522e3	Add a per-CPU page cache per VM free pool. Some workloads benefit from having a per-CPU cache for VM_FREEPOOL_DIRECT pages. Reviewed by: dougm, kib Discussed with: alc, jeff MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20858	2019-07-08 18:56:30 +00:00
Doug Moore	7b9bcad939	A style-related change, r349791, made unclear the meaning of a comment. Rewrite that comment to improve its clarity. Reported by: cem Reviewed by: alc, cem Approved by: kib, markj (mentors, implicit) Differential Revision: https://reviews.freebsd.org/D20871	2019-07-07 06:57:04 +00:00
Doug Moore	0cab71bcee	Fix style(9) violations involving division by PAGE_SIZE. Reviewed by: alc Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20847	2019-07-06 15:55:16 +00:00
Doug Moore	31c82722c1	Change blist_next_leaf_alloc so that it can examine more than one leaf after the one where the possible block allocation begins, and allocate a larger number of blocks than the current limit. This does not affect the limit on minimum allocation size, which still cannot exceed BLIST_MAX_ALLOC. Use this change to modify swp_pager_getswapspace and its callers, so that they can allocate more than BLIST_MAX_ALLOC blocks if they are available. Tested by: pho Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20579	2019-07-06 06:15:03 +00:00
Doug Moore	56948d177e	Based on work posted at https://reviews.freebsd.org/D13484 , change swap_pager_swapoff_object and swp_pager_force_pagein so that they can page in multiple pages at a time to a swap device, rather than doing one I/O operation for each page. Tested by: pho Submitted by: ota_j.email.ne.jp (Yoshihiro Ota) Reviewed by: alc, markj, kib Approved by: kib, markj (mentors) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20635	2019-07-05 16:49:34 +00:00
Doug Moore	d2860f22a4	Move an assignment, drop a label, and change gotos to break statements in vm_map_unwire. The code generated on amd86 is unchanged. Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20850	2019-07-04 19:25:30 +00:00
Doug Moore	b71f9b0de6	Replace a 'goto' with an 'else' in vm_map_wire_locked. Reviewed by: alc Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20855	2019-07-04 19:17:55 +00:00
Doug Moore	9a0cdf9440	Change boolean_t variables in vm_map_unwire and vm_map_wire_locked to bool. Drop result variable. Add holes_ok bool to replace repeated masking of flags parameter. Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20846	2019-07-04 19:12:13 +00:00
Doug Moore	723413be0c	Drop a temp variable from vm_map_insert, with no effect on the resulting amd64 machine code. Reviewed by: alc Approved by: kib, markj (mentors, implicit) Differential Revision: https://reviews.freebsd.org/D20849	2019-07-04 18:28:49 +00:00
Doug Moore	38e220e8df	Eliminate a goto and a label in vm_map_wire_locked by inserting an 'else'. Reviewed by: alc Approved by: kib, markj (mentors, implicit) Differential Revision: https://reviews.freebsd.org/D20845	2019-07-03 22:41:54 +00:00
Ed Maste	b93a053ca2	correct pmap_ts_referenced return type pmap_ts_referenced returns a count, not a boolean, and is supposed to have int as the return type not boolean_t. This worked previously because boolean_t is an int typedef. Discussed with: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-07-03 19:59:56 +00:00
Mark Johnston	d70f0ab38d	Cache the next queue element when traversing a page queue. When QUEUE_MACRO_DEBUG_TRASH is configured, removing a queue element invalidates its queue linkage pointers. vm_pageout_collect_batch() was relying on these pointers remaining valid after a removal, so modify it to fetch the next queued page before dequeuing the current page. Submitted by: Don Morris <dgmorris@earthlink.net> Reviewed by: cem, vangyzen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20842	2019-07-03 18:46:39 +00:00
Mark Johnston	9f74cdbf78	Mark pages allocated from the per-CPU cache. Only free pages to the cache when they were allocated from that cache. This mitigates rapid fragmentation of physical memory seen during poudriere's dependency calculation phase. In particular, pages belonging to broken reservations are no longer freed to the per-CPU cache, so they get a chance to coalesce with freed pages during the break. Otherwise, the optimized CoW handler may create object chains in which multiple objects contain pages from the same reservation, and the order in which we do object termination means that the reservation is broken before all of those pages are freed, so some of them end up in the per-CPU cache and thus permanently fragment physical memory. The flag may also be useful for eliding calls to vm_reserv_free_page(), thus avoiding memory accesses for data that is likely not present in the CPU caches. Reviewed by: alc Discussed with: jeff MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20763	2019-07-02 19:51:40 +00:00
Konstantin Belousov	5dc7e31a09	Control implicit PROT_MAX() using procctl(2) and the FreeBSD note feature bit. In particular, allocate the bit to opt-out the image from implicit PROTMAX enablement. Provide procctl(2) verbs to set and query implicit PROTMAX handling. The knobs mimic the same per-image flag and per-process controls for ASLR. Reviewed by: emaste, markj (previous version) Discussed with: brooks Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D20795	2019-07-02 19:07:17 +00:00
Konstantin Belousov	3730695151	Use traditional 'p' local to designate td->td_proc in kern_mmap. Reviewed by: emaste, markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D20795	2019-07-02 19:01:14 +00:00

... 3 4 5 6 7 ...

4507 Commits