freebsd-skq

Author	SHA1	Message	Date
Andrew Gallatin	2052680238	pcpu_page_alloc: guard against empty NUMA domains Some systems, such as higher end Threadripper, may have NUMA domains with no physical memory, Don't allocate from these domains. This fixes a "panic: vm_wait in early boot" on my 2990WX desktop Reviewed by: jeff Sponsored by: Netflix	2020-01-18 18:25:37 +00:00
Jeff Roberson	5844774900	Fix a long standing bug that was made worse in r355765. When we are cowing a page that was previously mapped read-only it exists in pmap until pmap_enter() returns. However, we held no reference to the original page after the copy was complete. This allowed vm_object_scan_all_shadowed() to collapse an object that still had pages mapped. To resolve this, add another page pointer to the faultstate so we can keep the page xbusy until we're done with pmap_enter(). Handle busy pages in scan_all_shadowed. This is already done in vm_object_collapse_scan(). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23155	2020-01-17 03:44:04 +00:00
Jeff Roberson	a81c400e75	Simplify VM and UMA startup by eliminating boot pages. Instead use careful ordering to allocate early pages in the same way boot pages were but only as needed. After the KVA allocator has started up we allocate the KVA that we consumed during boot. This also makes the boot pages freeable since they have vm_page structures allocated with the rest of memory. Parts of this patch were written and tested by markj. Reviewed by: glebius, markj Differential Revision: https://reviews.freebsd.org/D23102	2020-01-16 05:01:21 +00:00
Alexander Motin	ace409ce9c	Restore loop break in vm_pageout_lowmem(). r355004 removed return statement from this loop with intention to also call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes infinite loop. Reviewed by: markj Sponsored by: iXsystems, Inc.	2020-01-14 03:27:57 +00:00
Ryan Libby	9b8db4d0a0	uma: split slabzone into two sizes By allowing more items per slab, we can improve memory efficiency for small allocs. If we were just to increase the bitmap size of the slabzone, we would then waste slabzone memory. So, split slabzone into two zones, one especially for 8-byte allocs (512 per slab). The practical effect should be reduced memory usage for counter(9). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23149	2020-01-14 02:14:15 +00:00
Ryan Libby	e63a1c2f52	uma: fixup some ktr messages Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23148	2020-01-14 02:13:46 +00:00
Mateusz Guzik	a314aba874	vm: add missing CLTFLAG_MPSAFE annotations This covers all vm/* files.	2020-01-12 05:08:57 +00:00
Gleb Smirnoff	9328cbc047	Always multiple vm.pgcache_zone_max to number of CPUs, and rename it respectively. The tunable controls how big is the size of per-cpu vm page cache. Previously the value was split for all CPUs in system, so configuring same value on machines with different count of CPUs yielded in different cache size available to a particular CPU. Reviewed by: markj Obtained from: Netflix	2020-01-10 19:32:08 +00:00
Mark Johnston	860bb7a04c	UMA: Don't destroy zones after the system shutdown process starts. Some kernel subsystems, notably ZFS, will destroy UMA zones from a shutdown eventhandler. This causes the zone to be drained. For slabs that are mapped into KVA this can be very expensive and so it needlessly delays the shutdown process. Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy() into a no-op, provided that the zone does not have a custom finalization routine. PR: 242427 Reviewed by: jeff, kib, rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23066	2020-01-09 19:17:42 +00:00
Ryan Libby	4a8b575c6b	uma: unify layout paths and improve efficiency Unify the keg layout selection paths (keg_small_init, keg_large_init, keg_cachespread_init), and slightly improve memory efficiecy by: - using the padding of the final item to store the slab header, - not going OFFPAGE if we have a choice unless it improves efficiency. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23048	2020-01-09 02:03:17 +00:00
Ryan Libby	54c5ae804f	uma: reorganize flags - Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC. - Move flag VTOSLAB from public to private. - Introduce public NOTPAGE flag and make HASH private. - Introduce public NOTOUCH flag and make OFFPAGE private. - Update man page. The net effect of this should be to make the contract with clients more clear. Clients should choose constraints, UMA will figure out how to implement them. This also breaks the confusing double meaning of OFFPAGE. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23016	2020-01-09 02:03:03 +00:00
Jeff Roberson	79c9f9429a	Fix uma boot pages calculations on NUMA machines that also don't have MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt of zones while here. This was already correct because uz_cpu strongly aligned the zone structure but the specified alignment did not match reality and involved redundant defines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23046	2020-01-06 02:51:19 +00:00
Jeff Roberson	bfb6b7a121	The fix in r356353 was insufficient. Not every architecture returns 0 for EARLY_COUNTER. Only amd64 seems to. Suggested by: markj Reported by: lwhsu Reviewed by: markj PR: 243117	2020-01-05 22:54:25 +00:00
Kyle Evans	2180f6c6f1	kern_mmap: restore character deleted in transit Pointy hat to: kevans X-MFC-With: r356359	2020-01-04 23:51:44 +00:00
Kyle Evans	18348a2369	kern_mmap: add a variant that allows caller to inspect fp Linux mmap rejects mmap() on a write-only file with EACCES. linux_mmap_common currently does a fun dance to grab the fp associated with the passed in fd, validates it, then drops the reference and calls into kern_mmap(). Doing so is perhaps both fragile and premature; there's still plenty of chance for the request to get rejected with a more appropriate error, and it's prone to a race where the file we ultimately mmap has changed after it drops its referenced. This change alleviates the need to do this by providing a kern_mmap variant that allows the caller to inspect the fp just before calling into the fileop layer. The callback takes flags, prot, and maxprot as one could imagine scenarios where any of these, in conjunction with the file itself, may influence a caller's decision. The file type check in the linux compat layer has been removed; EINVAL is seemingly not an appropriate response to the file not being a vnode or device. The fileop layer will reject the operation with ENODEV if it's not supported, which more closely matches the common linux description of mmap(2) return values. If we discover that we're allowing an mmap() on a file type that Linux normally wouldn't, we should restrict those explicitly. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22977	2020-01-04 23:39:58 +00:00
Jeff Roberson	31c251a046	Fix an assertion introduced in r356348. On architectures without UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that violated the new assert. Resolve this by rewriting the COLD asserts to look at the per-cpu allocation counts for evidence of api activity. Discussed with: rlibby Reviewed by: markj Reported by: lwhsu	2020-01-04 19:29:25 +00:00
Jeff Roberson	dfe13344f5	UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending on kernel configuration. API users need only specify one if they want to override the default. Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA is. Reviewed by: markj Discussed with: rlibby Differential Revision: https://reviews.freebsd.org/D22831	2020-01-04 18:48:13 +00:00
Jeff Roberson	91d947bfbe	Sort cross-domain frees into per-domain buckets before inserting these onto their respective bucket lists. This is a several order of magnitude improvement in contention on the keg lock under heavy free traffic while requiring only an additional bucket per-domain worth of memory. Discussed with: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22830	2020-01-04 07:56:28 +00:00
Jeff Roberson	8b987a7769	Use per-domain keg locks. This provides both a lock and separate space accounting for each NUMA domain. Independent keg domain locks are important with cross-domain frees. Hashed zones are non-numa and use a single keg lock to protect the hash table. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22829	2020-01-04 03:30:08 +00:00
Jeff Roberson	727c691857	Use a separate lock for the zone and keg. This provides concurrency between populating buckets from the slab layer and fetching full buckets from the zone layer. Eliminate some nonsense locking patterns where we lock to fetch a single variable. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22828	2020-01-04 03:15:34 +00:00
Jeff Roberson	4bd61e19a2	Use atomics for the zone limit and sleeper count. This relies on the sleepq to serialize sleepers. This patch retains the existing sleep/wakeup paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup in one case but otherwise should be bug for bug compatible. In particular, there are still various races surrounding adjusting the limit via sysctl that are now documented. Discussed with: markj Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22827	2020-01-04 03:04:46 +00:00
Mateusz Guzik	b249ce48ea	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427	2020-01-03 22:29:58 +00:00
Mark Johnston	f7607c300b	Clear queue operation flags when migrating a page to another queue. The page daemon loops may move pages back to the active queue if references are detected. In this case we must take care to clear existing queue operation flags. In particular, PGA_REQUEUE_HEAD may be set, and that flag is only valid if the page belongs to the inactive queue. Also fix a bug in the active queue scan where we were updating "old" instead of "new". This would only have been hit in rare cases where the page moved out of the active queue after the beginning of the scan. Reported by: Bob Prohaska, Idwer Vollering Tested by: Idwer Vollering Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D23001	2020-01-02 19:26:04 +00:00
Doug Moore	668a8aa83b	The map-entry clipping functions modify start and end entries of an entry in the vm_map, making invariants related to the max_free entry field invalid. Move the clipping work into vm_map_entry_link, so that linking is okay when the new entry clips a current entry, and the vm_map doesn't have to be briefly corrupted. Change assertions and conditions in SPLAY_{LEFT,RIGHT}_STEP since the max_free invariants can now be trusted in all cases. Tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D22897	2019-12-31 22:20:54 +00:00
Mark Johnston	758b2c02bb	Restore a vm_page_wired() check in vm_page_mvqueue() after r356156. We now set PGA_DEQUEUE on a managed page when it is wired after allocation, and vm_page_mvqueue() ignores pages with this flag set, ensuring that they do not end up in the page queues. However, this is not sufficient for managed fictitious pages or pages managed by the TTM. In particular, the TTM makes use of the plinks.q queue linkage fields for its own purposes. PR: 242961 Reported and tested by: Greg V <greg@unrelenting.technology>	2019-12-29 20:01:03 +00:00
Mark Johnston	9b888dd9bd	Clear queue op flags in vm_page_mvqueue(). This fixes a regression in r356155, introduced at the last minute. In particular, we must clear PGA_REQUEUE_HEAD before inserting into any queue besides PQ_INACTIVE since that operation is implemented only for PQ_INACTIVE. Reported by: pho, Jenkins via lwhsu	2019-12-29 15:39:43 +00:00
Mark Johnston	727150ff03	Remove some unused functions. The previous series of patches orphaned some vm_page functions, so remove them. Reviewed by: dougm, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22886	2019-12-28 19:04:29 +00:00
Mark Johnston	dc71caa037	Update the vm_page.h block comment to reflect recent changes. Explain the new locking rules for per-page queue state updates. Reviewed by: jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22884	2019-12-28 19:04:15 +00:00
Mark Johnston	9f5632e6c8	Remove page locking for queue operations. With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885	2019-12-28 19:04:00 +00:00
Mark Johnston	b7f30bff2f	Generalize lazy dequeue logic for wired pages. Some recent work aims to remove the use of the page lock for synchronizing updates to page queue state. This change adds a mechanism to preserve the existing behaviour of lazily dequeuing wired pages, which was previously synchronized using the page lock. Handle this by setting PGA_DEQUEUE when a managed page's wire count transitions from 0 to 1. When the page daemon encounters a page with a flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that page, but in so doing it does not modify the page itself and thus racing with a concurrent free of the page is harmless. The flag is advisory; the page daemon still checks for wirings after acquiring the object and page xbusy locks. vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition. It must do this before dropping the reference to avoid a use-after-free but also handles races with concurrent wirings to ensure that PGA_DEQUEUE is not left unset on a wired page. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22882	2019-12-28 19:03:46 +00:00
Mark Johnston	f3f38e2580	Start implementing queue state updates using fcmpset loops. This is in preparation for eliminating the use of the vm_page lock for protecting queue state operations. Introduce the vm_page_pqstate_commit_*() functions. These functions act as helpers around vm_page_astate_fcmpset() and are specialized for specific types of operations. vm_page_pqstate_commit() wraps these functions. Convert a number of routines to use these new helpers. Use vm_page_release_toq() in vm_page_unwire() and vm_page_release() to atomically release a wiring reference and release the page into a queue. This has the side effect that vm_page_unwire() will leave the page in the active queue if it is already present there. Convert the page queue scans to use the new helpers. Simplify vm_pageout_reinsert_inactive(), which requeues pages that were found to be busy during an inactive queue scan, to avoid duplicating the work of vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or PGA_REQUEUE_HEAD is set, let that be handled during batch processing. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22770 Differential Revision: https://reviews.freebsd.org/D22771 Differential Revision: https://reviews.freebsd.org/D22772 Differential Revision: https://reviews.freebsd.org/D22773 Differential Revision: https://reviews.freebsd.org/D22776	2019-12-28 19:03:32 +00:00
Mark Johnston	3c01c56b0e	Don't update per-page activation counts in the swapout code. This avoids duplicating the work of the page daemon's active queue scan. Moreover, this duplication was inconsistent: - PGA_REFERENCED is not counted in act_count unless pmap_ts_referenced() returned 0, but the page daemon always counts PGA_REFERENCED towards the activation count. - The swapout daemon always activates a referenced page, but the page daemon only does so when the containing object is mapped at least once. The main purpose of swapout_deactivate_pages() is to shrink the number of pages mapped into a given pmap. To do this without unmapping active pages, use the non-destructive pmap_is_referenced() instead of the destructive pmap_ts_referenced() and deactivate pages accordingly. This simplifies some future changes to the locking protocol for page queue state. Reviewed by: kib Discussed with: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22674	2019-12-28 19:03:17 +00:00
Konstantin Belousov	df8db6ddb9	vm_object_shadow(): fix object reference leak. In r355270 by me, vm_object_shadow() was changed to handle the reference counting for the shared case, but the extra reference that was done in vmspace_fork() for the shared/need_copy case was not removed. Submitted by: jeff	2019-12-28 16:40:44 +00:00
Mark Johnston	5541eb27d6	Remove some stale comments from the page allocator. Since r352110 the page lock is not required to wire pages in any context.	2019-12-27 23:19:21 +00:00
Jeff Roberson	ff5ce8a7a5	Fix a pair of bugs introduced in r356002. When we reclaim physical pages we allocate them with VM_ALLOC_NOOBJ which means they are not busy. For now move the busy assert for the new page in vm_page_replace into the public api and out of the private api used by contig reclaim. Fix another issue where we would leak busy if the page could not be removed from pmap. Reported by: pho Discussed with: markj	2019-12-27 01:50:16 +00:00
Jeff Roberson	cc7ce83ae0	Further reduce the cacheline footprint of fast allocations by duplicating the zone size and flags fields in the per-cpu caches. This allows fast alloctions to proceed only touching the single per-cpu cacheline and simplifies the common case when no ctor/dtor is specified. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22826	2019-12-25 20:57:24 +00:00
Jeff Roberson	376b1ba394	Optimize fast path allocations by storing bucket headers in the per-cpu cache area. This allows us to check on bucket space for all per-cpu buckets with a single cacheline access and fewer branches. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22825	2019-12-25 20:50:53 +00:00
Jeff Roberson	3639ac42e5	Fix a bug with _NUMA domains introduced in r339686. When M_NOWAIT is specified there was no loop termination condition in keg_fetch_slab(). Reported by: pho Reviewed by: markj	2019-12-25 19:26:35 +00:00
Jeff Roberson	7e1b379e1e	Don't unnecessarily relock the vm object after sleeps. This results in a surprising amount of object contention on loop restarts in fault. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22821	2019-12-24 18:38:06 +00:00
Doug Moore	b649c2ac34	Fix typo using RB_INITIALIZER. The macro RB_INITIALIZER ignores its argument, but is documented to require "&head" as argument to initialize "head". So using "_vm_phys_fictitious_tree" as the argument to initialize "vm_phys_fictitious_tree" is an inconsequential error, corrected here. Discussed with: alc	2019-12-22 21:53:05 +00:00
Jeff Roberson	419f0b1f95	Fix a bug introduced in r356002. Prior versions of this patchset had vm_page_remove() rather than !vm_page_wired() as the condition for free. When this changed back to wired the busy lock was leaked. Reported by: pho Reviewed by: markj	2019-12-22 20:35:50 +00:00
Jeff Roberson	3cf3b4e641	Make page busy state deterministic on free. Pages must be xbusy when removed from objects including calls to free. Pages must not be xbusy when freed and not on an object. Strengthen assertions to match these expectations. In practice very little code had to change busy handling to meet these rules but we can now make stronger guarantees to busy holders and avoid conditionally dropping busy in free. Refine vm_page_remove() and vm_page_replace() semantics now that we have stronger guarantees about busy state. This removes redundant and potentially problematic code that has proliferated. Discussed with: markj Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22822	2019-12-22 06:56:44 +00:00
Jeff Roberson	bef91632da	Move vm_fault busy logic into its own function for clarity and re-use by later changes. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22820	2019-12-22 04:21:16 +00:00
Mark Johnston	d07c571806	Fix VPO_UNMANAGED handling in vm_page_reclaim_run() after r353540. When allocating a replacement page we must clear VPO_UNMANAGED since we only ever reclaim pages from managed objects. vm_page_replace() does not handle this for us. Sprinkle some assertions to help catch this sort of issue. Reported by: pho Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22868	2019-12-21 19:04:05 +00:00
Mark Johnston	c2f22e9790	Fix the aflag shift on big-endian platforms after r355672. The structure offset is zero regardless of endianness. Reported by: brooks Pointy hat: markj	2019-12-18 01:56:38 +00:00
Jeff Roberson	61a74c5ccd	schedlock 1/4 Eliminate recursion from most thread_lock consumers. Return from sched_add() without the thread_lock held. This eliminates unnecessary atomics and lock word loads as well as reducing the hold time for scheduler locks. This will eventually allow for lockless remote adds. Discussed with: kib Reviewed by: jhb Tested by: pho Differential Revision: https://reviews.freebsd.org/D22626	2019-12-15 21:11:15 +00:00
Jeff Roberson	4bf95d00ce	Previously we did not support invalid pages in default objects. This means that if fault fails to progress and needs to restart the loop it must free the page it is working on and allocate again on restart. Resolve the few places that need to be modified to support this condition and simply deactivate the page. Presently, we only permit this when fault restarts for busy contention. This has an added benefit of removing some object trylocking in this case. While here consolidate some page cleanup logic into fault_page_free() and fault_page_release() to reduce redundant code and automate some teardown. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22653	2019-12-15 04:08:24 +00:00
Jeff Roberson	a808177864	Add a deferred free mechanism for freeing swap space that does not require an exclusive object lock. Previously swap space was freed on a best effort basis when a page that had valid swap was dirtied, thus invalidating the swap copy. This may be done inconsistently and requires the object lock which is not always convenient. Instead, track when swap space is present. The first dirty is responsible for deleting space or setting PGA_SWAP_FREE which will trigger background scans to free the swap space. Simplify the locking in vm_fault_dirty() now that we can reliably identify the first dirty. Discussed with: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22654	2019-12-15 03:15:06 +00:00
Jeff Roberson	d966c7615f	Slightly optimize locking in vm_map_copy_swap_entry(). Anonymous objects require the object lock to synchronize collapse. Other swap objects such as tmpfs do not. Reported by: mjg Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22747	2019-12-15 02:02:27 +00:00
Jeff Roberson	af00971419	Handle pagein clustering in vm_page_grab_valid() so that it can be used by exec_map_first_page(). This will also enable pagein clustering for other interested consumers (tmpfs, md, etc). Discussed with: alc Approved by: kib Differential Revision: https://reviews.freebsd.org/D22731	2019-12-15 02:00:32 +00:00

1 2 3 4 5 ...

4290 Commits