freebsd-dev

Author	SHA1	Message	Date
Jeff Roberson	c49be4f1c6	Add unlocked grab* function variants that use lockless radix code to lookup pages. These variants will fall back to their locked counterparts if the page is not present. Discussed with: kib, markj Differential Revision: https://reviews.freebsd.org/D23449	2020-02-27 02:37:27 +00:00
Ed Maste	acb8858f05	Return ENOTSUP for mmap/mprotect if prot not subset of prot_max From POSIX, [ENOTSUP] The implementation does not support the combination of accesses requested in the prot argument. This fits the case that prot contains permissions which are not a subset of prot_max. Reviewed by: brooks, cem Relnotes: Yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23843	2020-02-26 20:03:43 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Doug Moore	36b01270d1	The last argument to swp_pager_getswapspace is always 1. Remove that argument. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23810	2020-02-24 04:01:09 +00:00
Mark Johnston	7ca5539285	Allow swap_pager_putpages() to allocate one block at a time. The minimum allocation size of 4 blocks is an old policy that came with the "new" swap pager in r42957. Since then the blist allocator has gotten better at reducing fragmentation; for example, with r349777 it can return a range that spans multiple leaves. When swap space is close to being exhaused, the minimum of 4 blocks most likely exacerbates memory pressure, so reduce it to 1. Reported by: alc Tested by: pho Reviewed by: alc, dougm, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23763	2020-02-23 17:59:51 +00:00
Ryan Libby	eaa17d4291	sys/vm: quiet -Wwrite-strings Discussed with: kib Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23796	2020-02-23 03:32:04 +00:00
Mark Johnston	0464f16e91	Constify uma_zcache_create() and uma_zsecond_create()'s "name" argument. It is already internally handled as a pointer to a const string, in particular by uma_zcreate(). Fix indentation while here. MFC after: 1 week	2020-02-22 17:44:28 +00:00
Kyle Evans	cef81f8f01	vm_radix: prefer __builtin_unreachable() to an unreachable panic() This provides the needed hint to GCC and offers an annotation for readers to observe that it's in-fact impossible to hit this point. We'll get hit with a a -Wswitch error if the enum applicable to the switch above were to get expanded without the new value(s) being handled.	2020-02-22 16:20:04 +00:00
Jeff Roberson	226dd6db47	Add an atomic-free tick moderated lazy update variant of SMR. This enables very cheap read sections with free-to-use latencies and memory overhead similar to epoch. On a recent AMD platform a read section cost 1ns vs 5ns for the default SMR. On Xeon the numbers should be more like 1 ns vs 11. The memory consumption should be proportional to the product of the free rate and 2*1/hz while normal SMR consumption is proportional to the product of free rate and maximum read section time. While here refactor the code to make future additions more straightforward. Name the overall technique Global Unbound Sequences (GUS) and adjust some comments accordingly. This helps distinguish discussions of the general technique (SMR) vs this specific implementation (GUS). Discussed with: rlibby, markj	2020-02-22 03:44:10 +00:00
Warner Losh	cafbf0c664	Don't convert all lower-layer errors to EIO. Don't convert all lower layer errors to EIO. Instead, pass the actual error up the stack. This will allow the upper layers that look for ENXIO to react properly to that signal from the lower layers and, for UFS, unmount the filesystem. Reviewed by: kib@ Differential Revision: https://reviews.freebsd.org/D23755	2020-02-20 01:33:01 +00:00
Warner Losh	65252dc903	Don't spam the console with an additional, and useless, error message. There's no need to spam the console with this error message. If there's an I/O error, the disk/cam driver will report it at the lower levels. If that's an actual problem, the upper layers will report that. Reviewed by: kib@ Differential Revision: https://reviews.freebsd.org/D23756	2020-02-20 00:34:46 +00:00
Jeff Roberson	4b3dac72b3	Silence a gcc warning about no return from a function that handles every possible enum in a switch statement. I verified that this emits nothing as expected on clang. radix relies on constant propagation to eliminate any branching from these access routines. Reported by: lwhsu/tinderbox	2020-02-19 22:34:22 +00:00
Jeff Roberson	1ddda2eb24	Use SMR to provide a safe unlocked lookup for vm_radix. The tree is kept correct for readers with store barriers and careful ordering. The existing object lock serializes writers. Consumers will be introduced in later commits. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D23446	2020-02-19 19:58:31 +00:00
Jeff Roberson	c6fd3e23f7	Use per-domain locks for the bucket cache. This gives much better concurrency when there are a large number of cores per-domain and multiple domains. Avoid taking the lock entirely if it will not be productive. ROUNDROBIN domains will have mixed memory in each domain and will load balance to all domains. While here refactor the zone/domain separation and bucket limits to simplify callers. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23673	2020-02-19 18:48:46 +00:00
Jeff Roberson	e9ceb9dd11	Don't release xbusy on kmem pages. After lockless page lookup we will not be able to guarantee that they can be racquired without blocking. Reviewed by: kib Discussed with: markj Differential Revision: https://reviews.freebsd.org/D23506	2020-02-19 09:10:11 +00:00
Jeff Roberson	6c5f36ff30	Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in virtual address or physical page allocation need to be marked with this flag. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D23712	2020-02-19 08:17:27 +00:00
Mark Johnston	34e2051faf	Remove swblk_t. It was used only to store the bounds of each swap device. However, since swblk_t is a signed 32-bit int and daddr_t is a signed 64-bit int, swp_pager_isondev() may return an invalid result if swap devices are repeatedly added and removed and sw_end for a device ends up becoming a negative number. Note that the removed comment about maximum swap size still applies. Reviewed by: jeff, kib Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23666	2020-02-17 15:11:07 +00:00
Mark Johnston	725b4ff001	Fix a swap block allocation race. putpages' allocation of swap blocks is done under the global sw_dev lock. Previously it would drop that lock before inserting the allocated blocks into the object's trie, creating a window in which swap blocks are allocated but are not visible to swapoff. This can cause swp_pager_strategy() to fail and panic the system. Fix the problem bluntly, by allocating swap blocks under the object lock. Reviewed by: jeff, kib Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23665	2020-02-17 15:10:41 +00:00
Mark Johnston	c90d075be4	Fix object locking races in swapoff(2). swap_pager_swapoff_object()'s goal is to allocate pages for all valid swap blocks belonging to the object, for which there is no resident page. If the page corresponding to a block is already resident and valid, the block can simply be discarded. The existing implementation tries to minimize the number of I/Os used. For each cluster of swap blocks, it finds maximal runs of valid swap blocks not resident in memory, and valid resident pages. During this processing, the object lock may be dropped in several places: when calling getpages, or when blocking on a busy page in vm_page_grab_pages(). While the lock is dropped, another thread may free swap blocks, causing getpages to page in stale data. Fix the problem following a suggestion from Jeff: use getpages' readahead capability to perform clustering rather than doing it ourselves. The simplies the code a bit without reintroducing the old behaviour of performing one I/O per page. Reviewed by: jeff Reported by: dhw, gallatin Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23664	2020-02-17 15:09:40 +00:00
Jeff Roberson	ed581bf68f	Add a simple accessor that returns the bytes of memory consumed by a zone.	2020-02-17 01:59:55 +00:00
Jeff Roberson	f212367b42	Refactor _vm_page_busy_sleep to reduce the delta between the various sleep routines and introduce a variant that supports lockless sleep. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23612	2020-02-17 01:08:00 +00:00
Jeff Roberson	70260874ac	UMA has become more particular about zone types. Use the right allocator calls in uma_zwait().	2020-02-17 01:06:18 +00:00
Jeff Roberson	6d88d784f8	Slightly restructure uma_zalloc* to generate better code from clang and reduce duplication among zalloc functions. Reviewed by: markj Discussed with: mjg Differential Revision: https://reviews.freebsd.org/D23672	2020-02-16 01:07:19 +00:00
Mateusz Guzik	3379d2f926	vm: use new capsicum helpers	2020-02-15 01:29:07 +00:00
Mateusz Guzik	23ed568caa	vm: remove no longer needed atomic_load_ptr casts	2020-02-14 23:16:29 +00:00
Mark Johnston	06ef60525f	Fix handling of WAITFAIL in vm_page_grab() and vm_page_grab_pages(). After sleeping through a memory shortage, we must return NULL rather than retry. Discussed with: jeff Reported by: pho Sponsored by: The FreeBSD Foundation	2020-02-13 23:18:35 +00:00
Mark Johnston	cefc92e1a2	Update the zone-global count of cached items in bucket_cache_reclaim(). This was missed in r351673. The count is used to enfore cache limits, which are rarely used. Discussed with: jeff Sponsored by: The FreeBSD Foundation	2020-02-13 23:15:21 +00:00
Jeff Roberson	543117bed8	Fix a case where ub_seq would fail to be set if the cross bucket was flushed due to memory pressure. Reviewed by: markj Differential Revision: http://reviews.freebsd.org/D23614	2020-02-13 20:58:51 +00:00
Mateusz Guzik	3acb6572fc	Store offset into zpcpu allocations in the per-cpu area. This shorten zpcpu_get and allows more optimizations. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23570	2020-02-12 11:11:22 +00:00
Mark Johnston	4ab3aee8fb	Reduce lock hold time in keg_drain(). Maintain a count of free slabs in the per-domain keg structure and use that to clear the free slab list in constant time for most cases. This helps minimize lock contention induced by reclamation, in preparation for proactive trimming of excesses of free memory. Reviewed by: jeff, rlibby Tested by: pho Differential Revision: https://reviews.freebsd.org/D23532	2020-02-11 20:06:33 +00:00
Jonathan T. Looney	3c200db9d2	Modify the vm.panic_on_oom sysctl to take a count of events. Currently, the vm.panic_on_oom sysctl is a boolean which controls the behavior of the VM system when it encounters an out-of-memory situation. If set to 0, the VM system kills the largest process. If set to any other value, the VM system will initiate a panic. This change makes the sysctl a count of events. If set to 0, the VM system kills the largest process. If set to any other value, the VM system will kill the largest process until it has seen the specified number of out-of-memory events. Once it reaches the specified number of events, it will initiate a panic. This change is helpful in capturing cores when the system is in a perpetual cycle of out-of-memory events (as opposed to just hitting one or two sporadic out-of-memory events). Reviewed by: kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23601	2020-02-10 18:06:38 +00:00
Ryan Libby	bae55c4aec	uma: remove UMA_ZFLAG_CACHEONLY flag UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but with a more confusing name. Remove the flag, make UMA_ZONE_VM an inherit flag, and replace all references. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23516	2020-02-06 08:32:25 +00:00
Ryan Libby	33e5a1ea3b	uma: multipage chicken switch Add a switch to allow disabling multipage slabs, in order to facilitate measuring memory usage and performance effects. The tunable vm.debug.uma_multipage_slabs defaults to 1 and can be set to 0 to disable. The name may change soon. Reviewed by: markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23487	2020-02-04 22:40:45 +00:00
Ryan Libby	27ca37acb7	uma: grow slabs to enforce minimum memory efficiency Memory efficiency can be poor with awkward item sizes (e.g. 1/2 or 1 page size + epsilon). In order to achieve a minimum memory efficiency, select a slab size with a potentially larger number of pages if it yields a lower portion of waste. This may mean using page_alloc instead of uma_small_alloc, which could be more costly. Discussed with: jeff, mckusick Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23239	2020-02-04 22:40:34 +00:00
Ryan Libby	ec0d828071	uma: add UMA_ZONE_CONTIG, and a default contig_alloc For now, copy the mbuf allocator. Reviewed by: jeff, markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23237	2020-02-04 22:40:11 +00:00
Ryan Libby	5ba16cf3d7	uma: pcpu_page_free needs to startup_free pages from startup_alloc After r357392, it is apparent that we do have some early-boot PCPU zones. Make it so we can safely free pages from them if they are actually used during early boot. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23496	2020-02-04 22:39:58 +00:00
Jeff Roberson	ee9e43f8dd	Add an explicit busy state for free pages. This improves behavior with potential bugs that access freed pages as well as providing a path towards lockless page lookup. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23444	2020-02-04 20:33:01 +00:00
Jeff Roberson	e84130a0c0	Use literal bucket sizes for smaller buckets rather than the rounding system. Small bucket sizes already pack well even if they are an odd number of words. This prevents any potential new instances of the problem fixed in r357463 as well as making the system easier to understand. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23494	2020-02-04 20:28:06 +00:00
Konstantin Belousov	8d34a3bf7d	Enable vm_object_mightbedirty() and vm_object_page_clean() for swap objects backing tmpfs vnodes data. The clean scan is limited to only remove write permissions from the mapped pages of the objects. This fixes the issue that tmpfs vnode mtime is not updated from writes to the mmaped area after the initial page-in. Noted by: mjg Reviewed by: markj Discussed with: jeff Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23432	2020-02-04 19:03:37 +00:00
Jeff Roberson	dc3915c8c6	Use STAILQ instead of TAILQ for bucket lists. We only need FIFO behavior and this is more space efficient. Stop queueing recently used buckets to the head of the list. If the bucket goes to a different processor the cache coherency will be more expensive. We already try to encourage cache-hot behavior in the per-cpu layer. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D23493	2020-02-04 02:41:24 +00:00
Mark Johnston	36cb95c736	Disable the smallest UMA bucket size on 32-bit platforms. With r357314, sizeof(struct uma_bucket) grew to 16 bytes on 32-bit platforms, so BUCKET_SIZE(4) is 0. This resulted in the creation of a bucket zone for buckets with zero capacity. A more general fix is planned, but for now this bandaid allows 32-bit platforms to boot again. PR: 243837 Discussed with: jeff Reported by: pho, Jenkins via lwhsu Tested by: pho Sponsored by: The FreeBSD Foundation	2020-02-03 19:29:02 +00:00
Warner Losh	58aa35d429	Remove sparc64 kernel support Remove all sparc64 specific files Remove all sparc64 ifdefs Removee indireeect sparc64 ifdefs	2020-02-03 17:35:11 +00:00
Mateusz Guzik	f1fa1ba3d0	Fix up various vnode-related asserts which did not dump the used vnode	2020-02-03 14:25:32 +00:00
Jeff Roberson	f96d4157a7	Fix a bug in r356776 where the page allocator was not properly restored to the percpu page allocator after it had been temporarily overridden by startup_alloc. Reported by: pho, bdragon	2020-02-01 23:46:30 +00:00
Mark Johnston	f0a273c00f	Remove a couple of lingering usages of the page lock. Update vm_page_scan_contig() and vm_page_reclaim_run() to stop using vm_page_change_lock(). It has no use after r356157. Remove vm_page_change_lock() now that it has no users. Remove an unncessary check for wirings in vm_page_scan_contig(), which was previously checking twice. The check is racy until vm_page_reclaim_run() ensures that the page is unmapped, so one check is sufficient. Reviewed by: jeff, kib (previous versions) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23279	2020-02-01 18:23:51 +00:00
Mateusz Guzik	643656cfaf	vfs: replace VOP_MARKATIME with VOP_MMAPPED The routine is only provided by ufs and is only used on mmap and exec. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23422	2020-02-01 06:46:55 +00:00
Jeff Roberson	9e47b34110	Fix LINT build with MEMGUARD.	2020-01-31 02:03:22 +00:00
Jeff Roberson	d4665eaa66	Implement a safe memory reclamation feature that is tightly coupled with UMA. This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is a unique algorithm. This has 3x the performance of epoch in a write heavy workload with less than half of the read side cost. The memory overhead is significantly lessened by limiting the free-to-use latency. A synthetic test uses 1/20th of the memory vs Epoch. There is significant further discussion in the comments and code review. This code should be considered experimental. I will write a man page after it has settled. After further validation the VM will begin using this feature to permit lockless page lookups. Both markj and cperciva tested on arm64 at large core counts to verify fences on weaker ordering architectures. I will commit a stress testing tool in a follow-up. Reviewed by: mmacy, markj, rlibby, hselasky Discussed with: sbahara Differential Revision: https://reviews.freebsd.org/D22586	2020-01-31 00:49:51 +00:00
Konstantin Belousov	b70f6e1513	Restore OOM logic on page fault after r357026. Right now OOM is initiated unconditionally on the page allocation failure, after the wait. Reported by: Mark Millard <marklmi@yahoo.com> Reviewed by: cy, markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D23409	2020-01-29 12:02:47 +00:00
Konstantin Belousov	cd0047f3a9	Handle a race of collapse with a retrying fault. Both vm_object_scan_all_shadowed() and vm_object_collapse_scan() might observe an invalid page left in the default backing object by the fault handler that retried. Check for the condition and refuse to collapse. Reported and tested by: pho Reviewed by: jeff Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D23331	2020-01-24 19:42:53 +00:00
Doug Moore	c7b23459b2	Most uses of vm_map_clip_start follow a call to vm_map_lookup. Define an inline function vm_map_lookup_clip_start that invokes them both and use it in places that invoke both. Drop a couple of local variables made unnecessary by this function. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D22987	2020-01-24 07:48:11 +00:00
Mark Johnston	e6bd3a812d	vm_map_submap(): Avoid unnecessary clipping. A submap can only be created from an entry spanning the entire request range. In particular, if vm_map_lookup_entry() returns false or the returned entry contains "end". Since the only use of submaps in FreeBSD is for the static pipe and execve argument KVA maps, this has no functional effect. Github PR: https://github.com/freebsd/freebsd/pull/420 Submitted by: Wuyang Chung <wuyang.chung1@gmail.com> (original) Reviewed by: dougm, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23299	2020-01-23 16:45:10 +00:00
Jeff Roberson	fb4d37eac1	(fault 9/9) Move zero fill into a dedicated function to make the object lock state more clear. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23326	2020-01-23 05:23:37 +00:00
Jeff Roberson	be9d4fd6b4	(fault 8/9) Restructure some code to reduce duplication and simplify flow control. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23321	2020-01-23 05:22:02 +00:00
Jeff Roberson	df794f5caf	(fault 7/9) Move fault population and allocation into a dedicated function Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23320	2020-01-23 05:19:39 +00:00
Jeff Roberson	5909dafea9	(fault 6/9) Move getpages and associated logic into a dedicated function. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23311	2020-01-23 05:18:00 +00:00
Jeff Roberson	91eb2e908f	(fault 5/9) Move the backing_object traversal into a dedicated function. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23310	2020-01-23 05:14:41 +00:00
Jeff Roberson	5936b6a8f1	(fault 4/9) Move copy-on-write into a dedicated function. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23304	2020-01-23 05:11:01 +00:00
Jeff Roberson	fcb0475833	(fault 3/9) Move map relookup into a dedicated function. Add a new VM return code KERN_RESTART which means, deallocate and restart in fault. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23303	2020-01-23 05:07:01 +00:00
Jeff Roberson	c308a3a6c9	(fault 2/9) Move map lookup into a dedicated function. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23302	2020-01-23 05:05:39 +00:00
Jeff Roberson	2c2f4413cc	(fault 1/9) Move a handful of stack variables into the faultstate. This additionally fixes a potential bug/pessimization where we could fail to reload the original fault_type on restart. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23301	2020-01-23 05:03:34 +00:00
Ryan Libby	8d1c459ae5	uma: fix zone domain overlaying pcpu cache with disabled cpus UMA zone structures have two arrays at the end which are sized according to the machine: an array of CPU count length, and an array of NUMA domain count length. The CPU counting was wrong in the case where some CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the second array to be overlaid with the first. Reported by: olivier Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23318	2020-01-23 04:56:38 +00:00
Ryan Libby	7e2406774e	uma: report leaks more accurately Previously UMA had some false negatives in the leak report at keg destruction time, where it only reported leaks if there were free items in the slab layer (rather than allocated items), which notably would not be true for single-item slabs (large items). Now, report a leak if there are any allocated pages, and calculate and report the number of allocated items rather than free items. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23275	2020-01-23 04:56:34 +00:00
Jeff Roberson	91e31c3c08	Consistently use busy and vm_page_valid() rather than touching page bits directly. This improves API compliance, asserts, etc. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23283	2020-01-23 04:54:49 +00:00
Jeff Roberson	530cc6a25d	Some architectures with DMAP still consume boot kva. Simplify the test for claiming kva in uma_startup2() to handle this. Reported by: bdragon	2020-01-23 03:37:35 +00:00
Jeff Roberson	5949b1ca8c	Move readahead and dropbehind fault functionality into a helper routine for clarity. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23282	2020-01-21 00:12:57 +00:00
Jeff Roberson	1e40fe41c5	Reduce object locking in vm_fault. Once we have an exclusively busied page we no longer need an object lock. This reduces the longest hold times and eliminates some trylock code blocks. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23034	2020-01-20 22:49:52 +00:00
Jeff Roberson	d6e13f3b4d	Don't hold the object lock while calling getpages. The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033	2020-01-19 23:47:32 +00:00
Jeff Roberson	9c83ff2d86	It has not been possible to recursively terminate a vnode object for some time now. Eliminate the dead code that supports it. Approved by: kib, markj Differential Revision: https://reviews.freebsd.org/D22908	2020-01-19 18:36:03 +00:00
Jeff Roberson	98087a066f	Make collapse synchronization more explicit and allow it to complete during paging. Shadow objects are marked with a COLLAPSING flag while they are collapsing with their backing object. This gives us an explicit test rather than overloading paging-in-progress. While split is on-going we mark an object with SPLIT. These two operations will modify the swap tree so they must be serialized and swap_pager_getpages() can now directly detect these conditions and page more conservatively. Callers to vm_object_collapse() now will reliably wait for a collapse to finish so that the backing chain is as short as possible before other decisions are made that may inflate the object chain. For example, split, coalesce, etc. It is now safe to run fault concurrently with collapse. It is safe to increase or decrease paging in progress with no lock so long as there is another valid ref on increase. This change makes collapse more reliable as a secondary benefit. The primary benefit is making it safe to drop the object lock much earlier in fault or never acquire it at all. This was tested with a new shadow chain test script that uncovered long standing bugs and will be integrated with stress2. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22908	2020-01-19 18:30:23 +00:00
Andrew Gallatin	2052680238	pcpu_page_alloc: guard against empty NUMA domains Some systems, such as higher end Threadripper, may have NUMA domains with no physical memory, Don't allocate from these domains. This fixes a "panic: vm_wait in early boot" on my 2990WX desktop Reviewed by: jeff Sponsored by: Netflix	2020-01-18 18:25:37 +00:00
Jeff Roberson	5844774900	Fix a long standing bug that was made worse in r355765. When we are cowing a page that was previously mapped read-only it exists in pmap until pmap_enter() returns. However, we held no reference to the original page after the copy was complete. This allowed vm_object_scan_all_shadowed() to collapse an object that still had pages mapped. To resolve this, add another page pointer to the faultstate so we can keep the page xbusy until we're done with pmap_enter(). Handle busy pages in scan_all_shadowed. This is already done in vm_object_collapse_scan(). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23155	2020-01-17 03:44:04 +00:00
Jeff Roberson	a81c400e75	Simplify VM and UMA startup by eliminating boot pages. Instead use careful ordering to allocate early pages in the same way boot pages were but only as needed. After the KVA allocator has started up we allocate the KVA that we consumed during boot. This also makes the boot pages freeable since they have vm_page structures allocated with the rest of memory. Parts of this patch were written and tested by markj. Reviewed by: glebius, markj Differential Revision: https://reviews.freebsd.org/D23102	2020-01-16 05:01:21 +00:00
Alexander Motin	ace409ce9c	Restore loop break in vm_pageout_lowmem(). r355004 removed return statement from this loop with intention to also call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes infinite loop. Reviewed by: markj Sponsored by: iXsystems, Inc.	2020-01-14 03:27:57 +00:00
Ryan Libby	9b8db4d0a0	uma: split slabzone into two sizes By allowing more items per slab, we can improve memory efficiency for small allocs. If we were just to increase the bitmap size of the slabzone, we would then waste slabzone memory. So, split slabzone into two zones, one especially for 8-byte allocs (512 per slab). The practical effect should be reduced memory usage for counter(9). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23149	2020-01-14 02:14:15 +00:00
Ryan Libby	e63a1c2f52	uma: fixup some ktr messages Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23148	2020-01-14 02:13:46 +00:00
Mateusz Guzik	a314aba874	vm: add missing CLTFLAG_MPSAFE annotations This covers all vm/* files.	2020-01-12 05:08:57 +00:00
Gleb Smirnoff	9328cbc047	Always multiple vm.pgcache_zone_max to number of CPUs, and rename it respectively. The tunable controls how big is the size of per-cpu vm page cache. Previously the value was split for all CPUs in system, so configuring same value on machines with different count of CPUs yielded in different cache size available to a particular CPU. Reviewed by: markj Obtained from: Netflix	2020-01-10 19:32:08 +00:00
Mark Johnston	860bb7a04c	UMA: Don't destroy zones after the system shutdown process starts. Some kernel subsystems, notably ZFS, will destroy UMA zones from a shutdown eventhandler. This causes the zone to be drained. For slabs that are mapped into KVA this can be very expensive and so it needlessly delays the shutdown process. Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy() into a no-op, provided that the zone does not have a custom finalization routine. PR: 242427 Reviewed by: jeff, kib, rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23066	2020-01-09 19:17:42 +00:00
Ryan Libby	4a8b575c6b	uma: unify layout paths and improve efficiency Unify the keg layout selection paths (keg_small_init, keg_large_init, keg_cachespread_init), and slightly improve memory efficiecy by: - using the padding of the final item to store the slab header, - not going OFFPAGE if we have a choice unless it improves efficiency. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23048	2020-01-09 02:03:17 +00:00
Ryan Libby	54c5ae804f	uma: reorganize flags - Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC. - Move flag VTOSLAB from public to private. - Introduce public NOTPAGE flag and make HASH private. - Introduce public NOTOUCH flag and make OFFPAGE private. - Update man page. The net effect of this should be to make the contract with clients more clear. Clients should choose constraints, UMA will figure out how to implement them. This also breaks the confusing double meaning of OFFPAGE. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23016	2020-01-09 02:03:03 +00:00
Jeff Roberson	79c9f9429a	Fix uma boot pages calculations on NUMA machines that also don't have MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt of zones while here. This was already correct because uz_cpu strongly aligned the zone structure but the specified alignment did not match reality and involved redundant defines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23046	2020-01-06 02:51:19 +00:00
Jeff Roberson	bfb6b7a121	The fix in r356353 was insufficient. Not every architecture returns 0 for EARLY_COUNTER. Only amd64 seems to. Suggested by: markj Reported by: lwhsu Reviewed by: markj PR: 243117	2020-01-05 22:54:25 +00:00
Kyle Evans	2180f6c6f1	kern_mmap: restore character deleted in transit Pointy hat to: kevans X-MFC-With: r356359	2020-01-04 23:51:44 +00:00
Kyle Evans	18348a2369	kern_mmap: add a variant that allows caller to inspect fp Linux mmap rejects mmap() on a write-only file with EACCES. linux_mmap_common currently does a fun dance to grab the fp associated with the passed in fd, validates it, then drops the reference and calls into kern_mmap(). Doing so is perhaps both fragile and premature; there's still plenty of chance for the request to get rejected with a more appropriate error, and it's prone to a race where the file we ultimately mmap has changed after it drops its referenced. This change alleviates the need to do this by providing a kern_mmap variant that allows the caller to inspect the fp just before calling into the fileop layer. The callback takes flags, prot, and maxprot as one could imagine scenarios where any of these, in conjunction with the file itself, may influence a caller's decision. The file type check in the linux compat layer has been removed; EINVAL is seemingly not an appropriate response to the file not being a vnode or device. The fileop layer will reject the operation with ENODEV if it's not supported, which more closely matches the common linux description of mmap(2) return values. If we discover that we're allowing an mmap() on a file type that Linux normally wouldn't, we should restrict those explicitly. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22977	2020-01-04 23:39:58 +00:00
Jeff Roberson	31c251a046	Fix an assertion introduced in r356348. On architectures without UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that violated the new assert. Resolve this by rewriting the COLD asserts to look at the per-cpu allocation counts for evidence of api activity. Discussed with: rlibby Reviewed by: markj Reported by: lwhsu	2020-01-04 19:29:25 +00:00
Jeff Roberson	dfe13344f5	UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending on kernel configuration. API users need only specify one if they want to override the default. Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA is. Reviewed by: markj Discussed with: rlibby Differential Revision: https://reviews.freebsd.org/D22831	2020-01-04 18:48:13 +00:00
Jeff Roberson	91d947bfbe	Sort cross-domain frees into per-domain buckets before inserting these onto their respective bucket lists. This is a several order of magnitude improvement in contention on the keg lock under heavy free traffic while requiring only an additional bucket per-domain worth of memory. Discussed with: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22830	2020-01-04 07:56:28 +00:00
Jeff Roberson	8b987a7769	Use per-domain keg locks. This provides both a lock and separate space accounting for each NUMA domain. Independent keg domain locks are important with cross-domain frees. Hashed zones are non-numa and use a single keg lock to protect the hash table. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22829	2020-01-04 03:30:08 +00:00
Jeff Roberson	727c691857	Use a separate lock for the zone and keg. This provides concurrency between populating buckets from the slab layer and fetching full buckets from the zone layer. Eliminate some nonsense locking patterns where we lock to fetch a single variable. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22828	2020-01-04 03:15:34 +00:00
Jeff Roberson	4bd61e19a2	Use atomics for the zone limit and sleeper count. This relies on the sleepq to serialize sleepers. This patch retains the existing sleep/wakeup paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup in one case but otherwise should be bug for bug compatible. In particular, there are still various races surrounding adjusting the limit via sysctl that are now documented. Discussed with: markj Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22827	2020-01-04 03:04:46 +00:00
Mateusz Guzik	b249ce48ea	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427	2020-01-03 22:29:58 +00:00
Mark Johnston	f7607c300b	Clear queue operation flags when migrating a page to another queue. The page daemon loops may move pages back to the active queue if references are detected. In this case we must take care to clear existing queue operation flags. In particular, PGA_REQUEUE_HEAD may be set, and that flag is only valid if the page belongs to the inactive queue. Also fix a bug in the active queue scan where we were updating "old" instead of "new". This would only have been hit in rare cases where the page moved out of the active queue after the beginning of the scan. Reported by: Bob Prohaska, Idwer Vollering Tested by: Idwer Vollering Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D23001	2020-01-02 19:26:04 +00:00
Doug Moore	668a8aa83b	The map-entry clipping functions modify start and end entries of an entry in the vm_map, making invariants related to the max_free entry field invalid. Move the clipping work into vm_map_entry_link, so that linking is okay when the new entry clips a current entry, and the vm_map doesn't have to be briefly corrupted. Change assertions and conditions in SPLAY_{LEFT,RIGHT}_STEP since the max_free invariants can now be trusted in all cases. Tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D22897	2019-12-31 22:20:54 +00:00
Mark Johnston	758b2c02bb	Restore a vm_page_wired() check in vm_page_mvqueue() after r356156. We now set PGA_DEQUEUE on a managed page when it is wired after allocation, and vm_page_mvqueue() ignores pages with this flag set, ensuring that they do not end up in the page queues. However, this is not sufficient for managed fictitious pages or pages managed by the TTM. In particular, the TTM makes use of the plinks.q queue linkage fields for its own purposes. PR: 242961 Reported and tested by: Greg V <greg@unrelenting.technology>	2019-12-29 20:01:03 +00:00
Mark Johnston	9b888dd9bd	Clear queue op flags in vm_page_mvqueue(). This fixes a regression in r356155, introduced at the last minute. In particular, we must clear PGA_REQUEUE_HEAD before inserting into any queue besides PQ_INACTIVE since that operation is implemented only for PQ_INACTIVE. Reported by: pho, Jenkins via lwhsu	2019-12-29 15:39:43 +00:00
Mark Johnston	727150ff03	Remove some unused functions. The previous series of patches orphaned some vm_page functions, so remove them. Reviewed by: dougm, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22886	2019-12-28 19:04:29 +00:00
Mark Johnston	dc71caa037	Update the vm_page.h block comment to reflect recent changes. Explain the new locking rules for per-page queue state updates. Reviewed by: jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22884	2019-12-28 19:04:15 +00:00
Mark Johnston	9f5632e6c8	Remove page locking for queue operations. With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885	2019-12-28 19:04:00 +00:00
Mark Johnston	b7f30bff2f	Generalize lazy dequeue logic for wired pages. Some recent work aims to remove the use of the page lock for synchronizing updates to page queue state. This change adds a mechanism to preserve the existing behaviour of lazily dequeuing wired pages, which was previously synchronized using the page lock. Handle this by setting PGA_DEQUEUE when a managed page's wire count transitions from 0 to 1. When the page daemon encounters a page with a flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that page, but in so doing it does not modify the page itself and thus racing with a concurrent free of the page is harmless. The flag is advisory; the page daemon still checks for wirings after acquiring the object and page xbusy locks. vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition. It must do this before dropping the reference to avoid a use-after-free but also handles races with concurrent wirings to ensure that PGA_DEQUEUE is not left unset on a wired page. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22882	2019-12-28 19:03:46 +00:00

1 2 3 4 5 ...

4410 Commits