freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	a7163bb962	Eliminate some vm object relocks in vm fault. For the vm_fault_prefault() call from vm_fault_soft_fast(), extend the scope of the object rlock to avoid re-taking it inside vm_fault_prefault(). It causes pmap_enter_quick() sometimes called with shadow object lock as well as the page lock, but this looks innocent. Noted and measured by: mjg Reviewed by: alc, markj (as part of the larger patch) Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15122	2018-04-29 12:43:08 +00:00
Mateusz Guzik	e825ab8d89	uma: whack main zone counter update in the slow path Cached counters are typically zero at this point so it performs avoidable atomics. Everything reading them also reads the cached ones, thus there is really no point. Reviewed by: jeff	2018-04-27 05:37:35 +00:00
Mateusz Guzik	23e17f83f1	vm: move vm_cnt to __read_mostly now that it is not written to While here whack unused locking keys for the struct. Discussed with: jeff	2018-04-27 05:36:02 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Mark Johnston	7e28037a09	Add a UMA zone flag to disable the use of buckets. This allows the creation of zones which don't do any caching in front of the keg. If the zone is a cache zone, this means that UMA will not attempt any memory allocations when allocating an item from the backend. This is intended for use after a panic by netdump, but likely has other applications. Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15184	2018-04-24 20:05:45 +00:00
Mark Johnston	64b3893010	Initialize marker pages in vm_page_domain_init(). They were previously initialized by the corresponding page daemon threads, but for vmd_inacthead this may be too late if vm_page_deactivate_noreuse() is called during boot. Reported and tested by: cperciva Reviewed by: alc, kib MFC after: 1 week	2018-04-19 14:09:44 +00:00
Mark Johnston	9de8fcfddf	Ensure that m and skip_m belong to the same object. Pages allocated from a given reservation may belong to different objects. It is therefore possible for vm_page_ps_test() to be called with the base page's object unlocked. Check for this case before asserting that the object lock is held. Reported by: jhb Reviewed by: kib MFC after: 1 week	2018-04-17 18:49:17 +00:00
Konstantin Belousov	e55d32b7b3	Handle Skylake-X errata SKZ63. SKZ63 Processor May Hang When Executing Code In an HLE Transaction Region Problem: Under certain conditions, if the processor acquires an HLE (Hardware Lock Elision) lock via the XACQUIRE instruction in the Host Physical Address range between 40000000H and 403FFFFFH, it may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS. Move the pages from the range into the blacklist. Add a tunable to not waste 4M if local DoS is not the issue. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15001	2018-04-07 17:06:13 +00:00
Brooks Davis	6469bdcdb6	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
Mark Johnston	c098768e4d	Ensure the background laundering threshold is positive after a scan. The division added in r331732 meant that we wouldn't attempt a background laundering until at least v_free_target - v_free_min clean pages had been freed by the page daemon since the last laundering. If the inactive queue is depleted but not completely empty (e.g., because it contains busy pages), it can thus take a long time to meet this threshold. Restore the pre-r331732 behaviour of using a non-zero background laundering threshold if at least one inactive queue scan has elapsed since the last attempt at background laundering. Submitted by: tijl (original version)	2018-04-02 15:07:41 +00:00
Gleb Smirnoff	b92b26ad08	Use UMA_SLAB_SPACE macro. No functional change here.	2018-04-02 05:15:25 +00:00
Gleb Smirnoff	96a10340ce	In uma_startup_count() handle special case when zone will fit into single slab, but with alignment adjustment it won't. Again, when there is only one item in a slab alignment can be ignored. See previous revision of this file for more info. PR: 227116	2018-04-02 05:14:31 +00:00
Gleb Smirnoff	1ca6ed4589	Handle a special case when a slab can fit only one allocation, and zone has a large alignment. With alignment taken into account uk_rsize will be greater than space in a slab. However, since we have only one item per slab, it is always naturally aligned. Code that will panic before this change with 4k page: z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0); uma_zalloc(z, M_WAITOK); A practical scenario to hit the panic is a machine with 56 CPUs and 2 NUMA domains, which yields in zone size of 3984. PR: 227116 MFC after: 2 weeks	2018-04-02 05:11:59 +00:00
Jeff Roberson	c33e3a642b	Add a uma cache of free pages in the DEFAULT freepool. This gives us per-cpu alloc and free of pages. The cache is filled with as few trips to the phys allocator as possible by the use of a new vm_phys_alloc_npages() function which allocates as many as N pages. This code was originally by markj with the import function rewritten by me. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14905	2018-04-01 04:50:05 +00:00
Jeff Roberson	e8bb2dc7c9	Add the flag ZONE_NOBUCKETCACHE. This flag instructions UMA not to keep a cache of fully populated buckets. This will be used in a follow-on commit. The flag idea was originally from markj. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-04-01 04:47:05 +00:00
Konstantin Belousov	19ea042eb8	Make vm_map_max/min/pmap KBI stable. There are out of tree consumers of vm_map_min() and vm_map_max(), and I believe there are consumers of vm_map_pmap(), although the later is arguably less in the need of KBI-stable interface. For the consumers benefit, make modules using this KPI not depended on the struct vm_map layout. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14902	2018-03-30 10:55:31 +00:00
Mark Johnston	6068486258	Fix the background laundering mechanism after r329882. Rather than using the number of inactive queue scans as a metric for how many clean pages are being freed by the page daemon, have the page daemon keep a running counter of the number of pages it has freed, and have the laundry thread use that when computing the background laundering threshold. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14884	2018-03-29 14:27:40 +00:00
Jeff Roberson	e5818a53db	Implement several enhancements to NUMA policies. Add a new "interleave" allocation policy which stripes pages across domains with a stride or width keeping contiguity within a multi-page region. Move the kernel to the dedicated numbered cpuset #2 making it possible to assign kernel threads and memory policy separately from user. This also eliminates the need for the complicated interrupt binding code. Add a sysctl API for viewing and manipulating domainsets. Refactor some of the cpuset_t manipulation code using the generic bitset type so that it can be used for both. This probably belongs in a dedicated subr file. Attempt to improve the include situation. Reviewed by: kib Discussed with: jhb (cpuset parts) Tested by: pho (before review feedback) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14839	2018-03-29 02:54:50 +00:00
Jeff Roberson	146bf2c66d	Move vm_ndomains to vm.h where it can be used with a single header include rather than requiring a half-dozen. Many non-vm files may want to know the number of valid domains. Sponsored by: Netflix, Dell/EMC Isilon	2018-03-27 03:27:02 +00:00
Konstantin Belousov	8ec533d336	Allow to specify for vm_fault_quick_hold_pages() that nofault mode should be honored. We must not sleep or acquire any MI VM locks if TDP_NOFAULTING is specified. On the other hand, there were some callers in the tree which set TDP_NOFAULTING for larger scope than needed, I fixed the code which I wrote, but I suspect that linuxkpi and out of tree drm drivers might abuse this still. So only enable the mode for vm_fault_quick_hold_pages() where vm_fault_hold() is not called when specifically asked by user. I decided to use vm_prot_t flag to not change KPI. Since number of flags in vm_prot_t is limited, I reused the same flag which was already consumed for vm_map_lookup(). Reported and tested by: pho (as part of the larger patch) Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14825	2018-03-26 16:31:12 +00:00
Konstantin Belousov	ed9e8bc468	Account the size of the vslock-ed memory by the thread. Assert that all such memory is unwired on return to usermode. The count of the wired memory will be used to detect the copyout mode. Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:51:27 +00:00
Konstantin Belousov	63b5d112b6	For vm_zone_stats() sysctl handler, do not drain sbuf calling copyout(9) while owning zone lock. Despite old value sysctl buffer is wired, spurious faults might still occur. Note that we still own the uma_rwlock there, but this lock does not participate in sensitive lock orders. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:48:53 +00:00
Jeff Roberson	2d3f4181de	Fix two compliation problems on non-amd64 architectures.	2018-03-23 18:24:02 +00:00
Mark Johnston	4046851367	Correct a couple of assertion messages in vm_page_reclaim_run(). MFC after: 3 days	2018-03-23 14:38:56 +00:00
Cy Schubert	72346b2232	Fix build on i386 without INVARIANTS following r331369. --- vm_reserv.o --- In file included from /opt/src/svn-current/sys/vm/vm_reserv.c:48: In file included from /opt/src/svn-current/sys/sys/counter.h:37: ./machine/counter.h:174:3: error: implicit declaration of function 'critical_enter' is invalid in C99 [-Werror,-Wimplicit-function-declarat ion] critical_enter(); Reviewed by: jeff@	2018-03-23 03:22:30 +00:00
Jeff Roberson	5c930c894d	Lock reservations with a dedicated lock in each reservation. Protect the vmd_free_count with atomics. This allows us to allocate and free from reservations without the free lock except where a superpage is allocated from the physical layer, which is roughly 1/512 of the operations on amd64. Use the counter api to eliminate cache conention on counters. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14707	2018-03-22 19:21:11 +00:00
Jeff Roberson	9a4b4cd3bc	Start witness much earlier in boot so that we can shrink the pend list and make it more immune to further change. Reviewed by: markj, imp (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:11:43 +00:00
Jeff Roberson	cdfeced8ff	Use read_mostly and alignment tags to eliminate or limit false sharing. Reviewed by: markj (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:06:50 +00:00
Konstantin Belousov	79e9552ebb	Check for wrap-around in vm_phys_alloc_seg_contig(). It is possible to provide insane values for size in contigmalloc(9) request, which usually not reaches the phys allocator due to failing KVA allocation. But with the forthcoming 4/4 i386, where 32bit architecture has almost 4G KVA, contigmalloc(1G) is not unreasonable outright and KVA might be available sometimes. Then, the calculation of pa_end could wrap around, depending on the physical address, and the checks in vm_phys_alloc_seg_contig() would pass while the iteration in the loop after the 'done' label goes out of the vm_page_array bounds. Fix it by detecting the wrap. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14767	2018-03-20 16:17:55 +00:00
Mark Johnston	c6a70eaea8	Avoid dequeuing the fault page during a soft fault. Such pages are re-enqueued at the end of the fault handler, preserving LRU. Rather than performing two separate operations per fault, simply requeue the page at the end of the fault (or bump its activation count if it resides in PQ_ACTIVE, avoiding the page queue lock entirely). This elides some page lock and page queue lock operations in common cases, e.g., CoW faults. Note that we must still dequeue the source page for "optimized" CoW faults since the page may not remain enqueued while it is moved to another object. Reviewed by: alc, kib Tested by: pho MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:49:30 +00:00
Mark Johnston	0eb50f9cd2	Have vm_page_{deactivate,launder}() requeue already-queued pages. In many cases the page is not enqueued so the change will have no effect. However, the change is needed to support an optimization in the fault handler and in some cases (sendfile, the buffer cache) it was being emulated by the caller anyway. Reviewed by: alc Tested by: pho MFC after: 2 weeks X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:40:56 +00:00
Mark Johnston	434862acb1	Have vm_page_replace() assert that the new page is not enqueued. The new page does not belong to a VM object, but the page daemon does not expect to encounter such pages. Reviewed by: alc, kib Tested by: pho MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:35:40 +00:00
Conrad Meyer	5d3b36666b	Fix GCC build: Remove redundant pagedaemon_wakeup declaration Introduced in r331018. Reported by: kevans Sponsored by: Dell EMC Isilon	2018-03-16 07:05:09 +00:00
Jeff Roberson	30fbfdda6c	Eliminate pageout wakeup races. Take another step towards lockless vmd_free_count manipulation. Reduce the scope of the free lock by using a pageout lock to synchronize sleep and wakeup. Only trigger the pageout daemon on transitions between states. Drive all wakeup operations directly as side-effects from freeing memory rather than requiring an additional function call. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14612	2018-03-15 19:23:07 +00:00
Konstantin Belousov	741e1c9196	Revert the chunk from r330410 in vm_page_reclaim_run(). There, the pages freed might be managed but the page's lock is not owned. For KPI correctness, the page lock is requried around the call to vm_page_free_prep(), which is asserted. Reclaim loop already did the work which could be done by vm_page_free_prep(), so the lock is not needed and the only consequence of not owning it is the assert trigger. Instead of adding the locking to satisfy the assert, revert to the code that calls vm_page_free_phys() directly. Reported by: pho Discussed with: jeff Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-13 18:27:23 +00:00
Jeff Roberson	f4af595964	Don't assert that the domain free lock is held until we're certain that there is a valid reservation. This can trip erroneously when memory falls within a domain but doesn't have the reservation initialized because it does not meet size or alignment requirements. Reported by: pho, mjg Sponsored by: Netflix, Dell/EMC Isilon	2018-03-07 22:04:27 +00:00
Konstantin Belousov	2a8e8f7892	Remove redundant test from r330410. If the input slist is non-empty, counter cannot be zero after freeing. Noted by: mjg MFC after: 2 weeks	2018-03-04 21:15:31 +00:00
Konstantin Belousov	8c8ee2ee1c	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485	2018-03-04 20:53:20 +00:00
Mark Johnston	3b8cf4acf0	Give the 0th domain's page daemon thread a consistent name. Page daemon threads for other domains show up in ps(1) output as "pagedaemon/domN", so let that be the case for domain 0 as well. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14518	2018-02-27 16:51:09 +00:00
Mark Johnston	59d3150b58	Restore the pre-r329882 inactive page shortage computation. With r329882, in the absence of a free page shortage we would only take len(PQ_INACTIVE)+len(PQ_LAUNDRY) into account when deciding whether to aggressively scan PQ_ACTIVE. Previously we would also include the number of free pages in this computation, ensuring that we wouldn't scan PQ_ACTIVE with plenty of free memory available. The change in behaviour was most noticeable immediately after booting, when PQ_INACTIVE and PQ_LAUNDRY are nearly empty. Reviewed by: jeff	2018-02-24 20:47:22 +00:00
Konstantin Belousov	cd84455f91	Hide all vm/vm_pageout.h content under #ifdef _KERNEL. There are no parts useful for usermode applications in vm/vm_pageout.h. Even for the specific applications like fstat and lsof. In my opinion, this protection is redundant and instead userspace should not include the header at all. Since there are apparently broken third party codebases, give them a bit of slack by providing transitional period. Reported by: julian Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-24 10:26:26 +00:00
Mark Johnston	5f70fb1425	Correct some comments after r328954. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14486	2018-02-23 23:27:53 +00:00
Mark Johnston	9140bff7ed	Remove a bogus assertion from vm_page_launder(). After r328977, a wired page m may have m->queue != PQ_NONE. Reviewed by: kib X-MFC with: r328977 Differential Revision: https://reviews.freebsd.org/D14485	2018-02-23 23:25:22 +00:00
Jeff Roberson	5f8cd1c0bf	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402	2018-02-23 22:51:51 +00:00
Konstantin Belousov	2c0f13aa59	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
Mark Johnston	3f060b60b1	Use the conventional name for an array of pages. No functional change intended. Discussed with: kib MFC after: 3 days	2018-02-16 15:38:22 +00:00
Konstantin Belousov	ada27a3bb8	Cleanup unused page argument for vm_reserv_break(). Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14364	2018-02-14 00:34:02 +00:00
Konstantin Belousov	d929ad7f91	Ensure memory consistency on COW. From the submitter description: The process is forked transitioning a map entry to COW Thread A writes to a page on the map entry, faults, updates the pmap to writable at a new phys addr, and starts TLB invalidations... Thread B acquires a lock, writes to a location on the new phys addr, and releases the lock Thread C acquires the lock, reads from the location on the old phys addr... Thread A ...continues the TLB invalidations which are completed Thread C ...reads from the location on the new phys addr, and releases the lock In this example Thread B and C [lock, use and unlock] properly and neither own the lock at the same time. Thread A was writing somewhere else on the page and so never had/needed the lock. Thread C sees a location that is only ever read\|modified under a lock change beneath it while it is the lock owner. To fix this, perform the two-stage update of the copied PTE. First, the PTE is updated with the address of the new physical page with copied content, but in read-only mode. The pmap locking and the page busy state during PTE update and TLB invalidation IPIs ensure that any writer to the page cannot upgrade the PTE to the writable state until all CPUs updated their TLB to not cache old mapping. Then, after the busy state of the page is lifted, the faults for write can proceed and do not violate the consistency of the reads. The change is done in vm_fault because most architectures do need IPIs to invalidate remote TLBs. More, I think that hardware guarantees of atomicity of the remote TLB invalidation are not enough to prevent the inconsistent reads of non-atomic reads, like multi-word accesses protected by a lock. So instead of modifying each pmap invalidation code, I did it there. Discovered and analyzed by: Elliott.Rabe@dell.com Reviewed by: markj PR: 225584 (appeared to have the same cause) Tested by: Elliott.Rabe@dell.com, emaste, Mike Tancsa <mike@sentex.net>, truckman Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:31:45 +00:00
Konstantin Belousov	607970bc8e	Do not call pmap_enter() with invalid protection mode. If the map entry elookup was performed due to the mapping changes, we need to ensure that there is still some access permission bit requested which is compatible with the current vm_map_entry mode. If not, restart the handler from scratch instead of trying to save the current progress. Also adjust fault_type to not include cleared permission bits. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:25:18 +00:00
Konstantin Belousov	c4be9169c0	Do not leak rv->psind in some specific situations. Suppose that we have an object with a mapped superpage, and that all pages in the superpages are held (by some driver). Additionally, suppose that the object is terminated, e.g. because the only process mapping it is exiting. Then the reservation is broken, but the pages cannot be freed until later, when they are unheld. In this situation, the reservation code cannot clean psind, since no pages are freed, and the page is freed and then reused with invalid psind. Clean psind on vm_reserv_break() to avoid the situation. Reported and tested by: Slava Shwartsman Reviewed by: markj Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14335	2018-02-13 15:36:28 +00:00

1 2 3 4 5 ...

3823 Commits