freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	d307bdcc2c	Further constrain the use of per-CPU caches for free pages. In low memory conditions a significant number of pages may end up stuck in the caches, and currently these caches cannot be reaped, leading to spurious memory allocation failures and OOM kills. So: - Take into account the fact that we may cache up to two full buckets of pages per CPU, not just one. - Increase the amount of RAM required per CPU to enable the caches. This is a temporary measure until the page cache management policy is improved. PR: 241048 Reported and tested by: Kevin Oberman <rkoberman@gmail.com> Reviewed by: alc, kib Discussed with: jeff MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22040	2019-10-18 17:36:42 +00:00
Mark Johnston	01cef4caa7	Remove page locking from pmap_mincore(). After r352110 the page lock no longer protects a page's identity, so there is no purpose in locking the page in pmap_mincore(). Instead, if vm.mincore_mapped is set to the non-default value of 0, re-lookup the page after acquiring its object lock, which holds the page's identity stable. The change removes the last callers of vm_page_pa_tryrelock(), so remove it. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21823	2019-10-16 22:03:27 +00:00
Jeff Roberson	fff5403f84	(5/6) Move the VPO_NOSYNC to PGA_NOSYNC to eliminate the dependency on the object lock in vm_page_set_validclean(). Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21595	2019-10-15 03:48:22 +00:00
Jeff Roberson	0012f373e4	(4/6) Protect page valid with the busy lock. Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594	2019-10-15 03:45:41 +00:00
Jeff Roberson	205be21d99	(3/6) Add a shared object busy synchronization mechanism that blocks new page busy acquires while held. This allows code that would need to acquire and release a very large number of page busy locks to use the old mechanism where busy is only checked and not held. This comes at the cost of false positives but never false negatives which the single consumer, vm_fault_soft_fast(), handles. Reviewed by: kib Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21592	2019-10-15 03:41:36 +00:00
Jeff Roberson	8da1c09853	(2/6) Don't release xbusy in vm_page_remove(), defer to vm_page_free_prep(). This persists busy state across operations like rename and replace. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21549	2019-10-15 03:38:02 +00:00
Jeff Roberson	63e9755548	(1/6) Replace busy checks with acquires where it is trival to do so. This is the first in a series of patches that promotes the page busy field to a first class lock that no longer requires the object lock for consistency. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21548	2019-10-15 03:35:11 +00:00
Leandro Lupori	0ecc478b74	[PPC64] Initial kernel minidump implementation Based on POWER9BSD implementation, with all POWER9 specific code removed and addition of new methods in PPC64 MMU interface, to isolate platform specific code. Currently, the new methods are implemented on pseries and PowerNV (D21643). Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D21551	2019-10-14 13:04:04 +00:00
Mark Johnston	4090e2170d	Assert that the PGA_{WRITEABLE,EXECUTABLE} flags do not leak. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21783	2019-10-07 23:31:17 +00:00
Mateusz Guzik	7b1fbc424a	vm: stop trylocking page queues in vm_page_pqbatch_submit About 11 minutes of poudriere -s -j 104 and probing on return value of trylocks reveals that over 10% of attempts fail, which in turn means there are more atomics performed than necessary. Trylocking was there to try preventing migration, but it's not very likely to happen if the lock is uncontested. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21925	2019-10-07 23:19:09 +00:00
Mark Johnston	7cc833c598	Fix a race in vm_page_swapqueue(). vm_page_swapqueue() atomically transitions a page between queues. To do so, it must hold the page queue lock for the old queue. However, once the queue index has been updated, the queue lock no longer protects the page's queue state. Thus, we must speculatively remove the page from the old queue before committing the queue state update, and roll back if the update fails. Reported and tested by: pho Reviewed by: kib Sponsored by: Intel, Netflix Differential Revision: https://reviews.freebsd.org/D21791	2019-09-27 16:46:08 +00:00
Mark Johnston	2b93f779d2	Add some counters for per-VM page events. For now, just count batched page queue state operations. vm.stats.page.queue_ops counts the number of batch entries that successfully completed, while queue_nops counts entries that had no effect, which occurs when the queue operation had been completed before the batch entry was processed. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Intel, Netflix Differential Revision: https://reviews.freebsd.org/D21782	2019-09-25 17:08:35 +00:00
Mark Johnston	923da43e7c	Fix a race in vm_page_dequeue_deferred_free() after r352110. This function loaded the page's queue index before setting PGA_DEQUEUE. In this window the page daemon may have deactivated the page, updating its queue index. Make the operation atomic using vm_page_pqstate_cmpset(); the page daemon will not modify the page once it observes that PGA_DEQUEUE is set. Reported and tested by: pho Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639	2019-09-16 15:12:49 +00:00
Mark Johnston	47aef898ea	Fix a page leak in vm_page_reclaim_run(). After r352110 the attempt to remove mappings of the page being replaced may fail if the page is wired. In this case we must free the replacement page. Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639	2019-09-16 15:09:31 +00:00
Mark Johnston	e8bcf6966b	Revert r352406, which contained changes I didn't intend to commit.	2019-09-16 15:04:45 +00:00
Mark Johnston	41fd4b9422	Fix a couple of nits in r352110. - Remove a dead variable from the amd64 pmap_extract_and_hold(). - Fix grammar in the vm_page_wire man page. Reported by: alc Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639	2019-09-16 15:03:12 +00:00
Jeff Roberson	c75757481f	Replace redundant code with a few new vm_page_grab facilities: - VM_ALLOC_NOCREAT will grab without creating a page. - vm_page_grab_valid() will grab and page in if necessary. - vm_page_busy_acquire() automates some busy acquire loops. Discussed with: alc, kib, markj Tested by: pho (part of larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21546	2019-09-10 19:08:01 +00:00
Jeff Roberson	4cdea4a853	Use the sleepq lock rather than the page lock to protect against wakeup races with page busy state. The object lock is still used as an interlock to ensure that the identity stays valid. Most callers should use vm_page_sleep_if_busy() to handle the locking particulars. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21255	2019-09-10 18:27:45 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Mark Johnston	7cdeaf3309	Add preliminary support for atomic updates of per-page queue state. Queue operations on a page use the page lock when updating the page to reflect the desired queue state, and the page queue lock when physically enqueuing or dequeuing a page. Multiple pages share a given page lock, but queue state is per-page; this false sharing results in heavy lock contention. Take a small step towards the use of atomic_cmpset to synchronize updates to per-page queue state by introducing vm_page_pqstate_cmpset() and using it in the page daemon. In the longer term the plan is to stop using the page lock to protect page identity and rely only on the object and page busy locks. However, since the page daemon avoids acquiring the object lock except when necessary, some synchronization with a concurrent free of the page is required. vm_page_pqstate_cmpset() can be used to ensure that queue state updates are successful only if the page is not scheduled for a dequeue, which is sufficient for the page daemon. Add vm_page_swapqueue(), which moves a page from one queue to another using vm_page_pqstate_cmpset(). Use it in the active queue scan, which does not use the object lock. Modify vm_page_dequeue_deferred() to use vm_page_pqstate_cmpset() as well. Reviewed by: kib Discussed with: jeff Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21257	2019-09-03 14:29:58 +00:00
Mark Johnston	9d75f0dc75	Map the vm_page array into KVA on amd64. r351198 allows the kernel to use domain-local memory to back the vm_page array (up to 2MB boundaries) and reserves a separate PML4 entry for that purpose. One consequence of that change is that the vm_page array is no longer present in minidumps, which only adds pages mapped above VM_MIN_KERNEL_ADDRESS. To avoid the friction caused by having kernel data structures mapped below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry. Reviewed by: kib Discussed with: jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21491	2019-09-03 13:18:51 +00:00
Mark Johnston	a70e17eeca	Fix a few nits in vm_pqbatch_process_page(). - Don't bother masking off non-queue state flags when loading the page's atomic state, since it is only required for one of the function's assertions. Update the assertion instead. - Remove an incorrect comment regarding synchronization with the page daemon. The page daemon only ever checks for PGA_ENQUEUED with the page queue lock held. - When clearing requeue flags, only clear the flags that have been acted upon. Reviewed by: kib (previous version) Discussed with: alc Tested by: pho (part of a larger patch) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21368	2019-08-26 20:20:10 +00:00
Mark Johnston	f93670b7b9	Stop clearing page flags in vm_page_pqbatch_submit(). All existing callers guarantee that the page does not have a pre-existing dequeue pending. Thus, if the page is dequeued before pqbatch_submit() acquires the page queue lock, we do not need to do anything since vm_page_dequeue_complete() takes care of clearing all page queue state flags for us. With this change, vm_page_pqbatch_submit() has the nice property that it does not directly modify any fields in the page structure. Reviewed by: alc, kib Tested by: pho (part of a larger change) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21372	2019-08-23 19:53:11 +00:00
Mark Johnston	386eba08bd	Make vm_pqbatch_submit_page() externally visible. It will become useful for the page daemon to be able to directly create a batch queue entry for a page, and without modifying the page structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit() to keep the namespace consistent. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21369	2019-08-23 19:49:29 +00:00
Mark Johnston	8b90607f20	Simplify vm_page_dequeue() and fix an assertion. - Add a vm_pagequeue_remove() function to physically remove a page from its queue and update the queue length. - Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue() return NULL for dequeued pages. - Avoid unnecessarily reloading the queue index if vm_page_dequeue() loses a race with a concurrent queue operation. - Correct an always-true assertion: vm_page_dequeue() may be called from the page allocator with the page unlocked. The assertion m->order == VM_NFREEORDER simply tests whether the page has been removed from the vm_phys free lists; instead, check whether the page belongs to an object. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21341	2019-08-21 16:11:12 +00:00
Jeff Roberson	3e5e1b5135	Allocate amd64's page array using pages and page directory pages from the NUMA domain that the pages describe. Patch original from gallatin. Reviewed by: kib Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21252	2019-08-18 23:07:56 +00:00
Jeff Roberson	b7565d44df	Encapsulate phys_avail manipulation in a set of simple routines. Add a NUMA aware boot time memory allocator that will be used to allocate early domain correct structures. Code partially submitted by gallatin. Reviewed by: gallatin, kib Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21251	2019-08-18 07:06:31 +00:00
Konstantin Belousov	245139c69d	Fix OOM handling of some corner cases. In addition to pagedaemon initiating OOM, also do it from the vm_fault() internals. Namely, if the thread waits for a free page to satisfy page fault some preconfigured amount of time, trigger OOM. These triggers are rate-limited, due to a usual case of several threads of the same multi-threaded process to enter fault handler simultaneously. The faults from pagedaemon threads participate in the calculation of OOM rate, but are not under the limit. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13671	2019-08-16 09:43:49 +00:00
Mark Johnston	98549e2dc6	Centralize the logic in vfs_vmio_unwire() and sendfile_free_page(). Both of these functions atomically unwire a page, optionally attempt to free the page, and enqueue or requeue the page. Add functions vm_page_release() and vm_page_release_locked() to perform the same task. The latter must be called with the page's object lock held. As a side effect of this refactoring, the buffer cache will no longer attempt to free mapped pages when completing direct I/O. This is consistent with the handling of pages by sendfile(SF_NOCACHE). Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20986	2019-07-29 22:01:28 +00:00
Mark Johnston	b16e57a6c9	Rename vm_page_{import,release}() to vm_page_zone_{import,release}(). I would like to use the name vm_page_release() for a different purpose, and vm_page_{import,release}() are local to vm_page.c. Reviewed by: kib MFC after: 1 week	2019-07-20 18:25:41 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
Mark Johnston	46736e306c	Elide the vm_reserv_free_page() call when PG_PCPU_CACHE is set. Pages with PG_PCPU_CACHE set cannot have been allocated from a reservation, so as an optimization, skip the call to vm_reserv_free_page() in this case. Otherwise, the access of the corresponding reservation structure often results in a cache miss. Reviewed by: alc, kib Discussed with: jeff MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20859	2019-07-08 19:02:40 +00:00
Mark Johnston	d9a73522e3	Add a per-CPU page cache per VM free pool. Some workloads benefit from having a per-CPU cache for VM_FREEPOOL_DIRECT pages. Reviewed by: dougm, kib Discussed with: alc, jeff MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20858	2019-07-08 18:56:30 +00:00
Mark Johnston	9f74cdbf78	Mark pages allocated from the per-CPU cache. Only free pages to the cache when they were allocated from that cache. This mitigates rapid fragmentation of physical memory seen during poudriere's dependency calculation phase. In particular, pages belonging to broken reservations are no longer freed to the per-CPU cache, so they get a chance to coalesce with freed pages during the break. Otherwise, the optimized CoW handler may create object chains in which multiple objects contain pages from the same reservation, and the order in which we do object termination means that the reservation is broken before all of those pages are freed, so some of them end up in the per-CPU cache and thus permanently fragment physical memory. The flag may also be useful for eliding calls to vm_reserv_free_page(), thus avoiding memory accesses for data that is likely not present in the CPU caches. Reviewed by: alc Discussed with: jeff MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20763	2019-07-02 19:51:40 +00:00
Mark Johnston	0fd977b3fa	Add a return value to vm_page_remove(). Use it to indicate whether the page may be safely freed following its removal from the object. Also change vm_page_remove() to assume that the page's object pointer is non-NULL, and have callers perform this check instead. This is a step towards an implementation of an atomic reference counter for each physical page structure. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20758	2019-06-26 17:37:51 +00:00
Mark Johnston	ee1f168540	Group vm_page_activate()'s definition with other related functions. No functional change intended. MFC after: 3 days	2019-06-19 21:36:00 +00:00
Mark Johnston	2d2748710a	Remove an outdated header comment for vm_page.c. The listed rules were incomplete and outdated. There is a much more comprehensive comment in vm_page.h. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20503	2019-06-04 18:38:27 +00:00
Alan Cox	2d5039db18	Retire vm_reserv_extend_{contig,page}(). These functions were introduced as part of a false start toward fine-grained reservation locking. In the end, they were not needed, so eliminate them. Order the parameters to vm_reserv_alloc_{contig,page}() consistently with the vm_page functions that call them. Update the comments about the locking requirements for vm_reserv_alloc_{contig,page}(). They no longer require a free page queues lock. Wrap several lines that became too long after the "req" and "domain" parameters were added to vm_reserv_alloc_{contig,page}(). Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20492	2019-06-03 05:15:36 +00:00
Mark Johnston	d842aa5114	Add a vm_page_wired() predicate. Use it instead of accessing the wire_count field directly. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20485	2019-06-02 01:00:17 +00:00
Doug Moore	b8590dae50	The function vm_phys_free_contig invokes vm_phys_free_pages for every power-of-two page block it frees, launching an unsuccessful search for a buddy to pair up with each time. The only possible buddy-up mergers are across the boundaries of the freed region, so change vm_phys_free_contig simply to enqueue the freed interior blocks, via a new function vm_phys_enqueue_contig, and then call vm_phys_free_pages on the bounding blocks to create as big a cross-boundary block as possible after buddy-merging. The only callers of vm_phys_free_contig at the moment call it in situations where merging blocks across the boundary is clearly impossible, so just call vm_phys_enqueue_contig in those places and avoid trying to buddy-up at all. One beneficiary of this change is in breaking reservations. For the case where memory is freed in breaking a reservation with only the first and last pages allocated, the number of cycles consumed by the operation drops about 11% with this change. Suggested by: alc Reviewed by: alc Approved by: kib, markj (mentors) Differential Revision: https://reviews.freebsd.org/D16901	2019-05-31 21:02:42 +00:00
Mark Johnston	42447bb506	Remove a redundant vm_page_remove() call. vm_page_free_prep() removes the page from its object. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20469	2019-05-31 14:59:40 +00:00
Mark Johnston	3b5b20292b	Implement minidump support for RISC-V. Submitted by: Mitchell Horne <mhorne063@gmail.com> Differential Revision: https://reviews.freebsd.org/D18320	2019-03-06 00:01:06 +00:00
Mark Johnston	1e2b3e6f92	Allow vm_page_free_prep() to dequeue pages without the page lock. This is a step towards being able to free pages without the page lock held. The approach is simply to add an implementation of vm_page_dequeue_deferred() which does not assert that the page lock is held. Formally, the page lock is required to set PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the same mutual exclusion for free by virtue of the fact that no other references to the page may exist. No functional change intended. Reviewed by: kib (previous version) MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19065	2019-02-03 18:43:20 +00:00
Mark Johnston	d0488e698f	Fix a race in vm_page_dequeue_deferred(). To detect the case where the page is already marked for a deferred dequeue, we must read the "queue" and "aflags" fields in a precise order. Otherwise, a race with a concurrent vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE set despite it already having been dequeued. Fix the problem by using vm_page_queue() to check the queue state, which correctly handles the race. Reviewed by: kib Tested by: pho MFC after: 3 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19039	2019-02-03 18:38:58 +00:00
Gleb Smirnoff	bb15d1c778	o Move zone limit from keg level up to zone level. This means that now two zones sharing a keg may have different limits. Now this is going to work: zone = uma_zcreate(); uma_zone_set_max(zone, limit); zone2 = uma_zsecond_create(zone); uma_zone_set_max(zone2, limit2); Kegs no longer have uk_maxpages field, but zones have uz_items. When set, it may be rounded up to minimum possible CPU bucket cache size. For small limits bucket cache can also be reconfigured to be smaller. Counter uz_items is updated whenever items transition from keg to a bucket cache or directly to a consumer. If zone has uz_maxitems set and it is reached, then we are going to sleep. o Since new limits don't play well with multi-keg zones, remove them. The idea of multi-keg zones was introduced exactly 10 years ago, and never have had a practical usage. In discussion with Jeff we came to a wild agreement that if we ever want to reintroduce the idea of a smart allocator that would be able to choose between two (or more) totally different backing stores, that choice should be made one level higher than UMA, e.g. in malloc(9) or in mget(), or whatever and choice should be controlled by the caller. o Sleeping code is improved to account number of sleepers and wake them one by one, to avoid thundering herd problem. o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache() KPI added. Having no bucket cache basically means setting maxcache to 0. o Now with many fields added and many removed (no multi-keg zones!) make sure that struct uma_zone is perfectly aligned. Reviewed by: markj, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D17773	2019-01-15 00:02:06 +00:00
Gleb Smirnoff	9cc36b3dab	Fix regression in r331368, that broke dumping of UMA startup pages when WITNESS is present. Discussed with: markj	2019-01-07 23:17:09 +00:00
Konstantin Belousov	7af4985245	Add 'v' modifier to the ddb 'show pginfo' command to display vm_page backing the provided kernel virtual address. Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-30 15:58:18 +00:00
Mark Johnston	e31fc3ab13	Update the free page count when blacklisting pages. Otherwise the free page count will not accurately reflect the physical page allocator's state. On 11 this can trigger panics in vm_page_alloc() since the allocator state and free page count are updated atomically and we expect them to stay in sync. On 12 the bug would manifest as threads looping in vm_page_alloc(). PR: 231296 Reported by: mav, wollman, Rainer Duffner, Josh Gitlin Reviewed by: alc, kib, mav MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18374	2018-11-29 16:31:01 +00:00
Mark Johnston	920239efde	Fix some problems that manifest when NUMA domain 0 is empty. - In uma_prealloc(), we need to check for an empty domain before the first allocation attempt, not after. Fix this by switching uma_prealloc() to use a vm_domainset iterator, which addresses the secondary issue of using a signed domain identifier in round-robin iteration. - Don't automatically create a page daemon for domain 0. - In domainset_empty_vm(), recompute ds_cnt and ds_order after excluding empty domains; otherwise we may frequently specify an empty domain when calling in to the page allocator, wasting CPU time. Convert DOMAINSET_PREF() policies for empty domains to round-robin. - When freeing bootstrap pages, don't count them towards the per-domain total page counts for now: some vm_phys segments are created before the SRAT is parsed and are thus always identified as being in domain 0 even when they are not. Then, when bootstrap pages are freed, they are added to a domain that we had previously thought was empty. Until this is corrected, we simply exclude them from the per-domain page count. Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com> Reviewed by: gallatin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17704	2018-10-30 17:57:40 +00:00
Mark Johnston	4c29d2de67	Refactor domainset iterators for use by malloc(9) and UMA. Before this change we had two flavours of vm_domainset iterators: "page" and "malloc". The latter was only used for kmem_() and hard-coded its behaviour based on kernel_object's policy. Moreover, its use contained a race similar to that fixed by r338755 since the kernel_object's iterator was being run without the object lock. In some cases it is useful to be able to explicitly specify a policy (domainset) or policy+iterator (domainset_ref) when performing memory allocations. To that end, refactor the vm_dominset_ KPI to permit this, and get rid of the "malloc" domainset_iter KPI in the process. Reviewed by: jeff (previous version) Tested by: pho (part of a larger patch) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17417	2018-10-23 16:35:58 +00:00
Mark Johnston	463406ac4a	Add more NUMA-specific low memory predicates. Use these predicates instead of inline references to vm_min_domains. Also add a global all_domains set, akin to all_cpus. Reviewed by: alc, jeff, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17278	2018-09-24 19:24:17 +00:00
Mark Johnston	7a364d458a	Split some checks in vm_page_activate() to make it easier to read. No functional change intended. Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17028	2018-09-10 18:59:23 +00:00
Mark Johnston	5a7f993702	Relax an assertion in vm_pqbatch_process_page(). While executing vm_pqbatch_process_page(m), m->queue may change to PQ_NONE if the page daemon is concurrently freeing the page. In this case m's queue state flags must be clear, so vm_pqbatch_process_page() will be a no-op, but the race could cause spurious assertion failures. Correct the assertion which assumed that m->queue's value does not change while the page queue lock is held. Reviewed by: alc, kib Reported and tested by: pho Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17027	2018-09-08 21:49:43 +00:00
Mark Johnston	23984ce5cd	Avoid resource deadlocks when one domain has exhausted its memory. Attempt other allowed domains if the requested domain is below the minimum paging threshold. Block in fork only if all domains available to the forking thread are below the severe threshold rather than any. Submitted by: jeff Reported by: mjg Reviewed by: alc, kib, markj Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D16191	2018-09-06 19:28:52 +00:00
Mark Johnston	21f01f4584	Remove vm_page_remque(). Testing m->queue != PQ_NONE is not sufficient; see the commit log message for r338276. As of r332974 vm_page_dequeue() handles already-dequeued pages, so just replace vm_page_remque() calls with vm_page_dequeue() calls. Reviewed by: kib Tested by: pho Approved by: re (marius) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17025	2018-09-06 16:17:45 +00:00
Mark Johnston	899fe184c7	Add a per-pagequeue pdpages counter. Expose these counters under the vm.domain sysctl node. The existing vm.stats.vm.v_pdpages sysctl is preserved. Reviewed by: alc (previous version) Differential Revision: https://reviews.freebsd.org/D14666	2018-08-23 21:03:45 +00:00
Mark Johnston	99d92d732f	Ensure that queue state is cleared when vm_page_dequeue() returns. Per-page queue state is updated non-atomically, with either the page lock or the page queue lock held. When vm_page_dequeue() is called without the page lock, in rare cases a different thread may be concurrently dequeuing the page with the pagequeue lock held. Because of the non-atomic update, vm_page_dequeue() might return before queue state is completely updated, which can lead to race conditions. Restrict the vm_page_dequeue() interface so that it must be called either with the page lock held or on a free page, and busy wait when a different thread is concurrently updating queue state, which must happen in a critical section. While here, do some related cleanup: inline vm_page_dequeue_locked() into its only caller and delete a prototype for the unimplemented vm_page_requeue_locked(). Replace the volatile qualifier for "queue" added in r333703 with explicit uses of atomic_load_8() where required. Reported and tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D15980	2018-08-23 20:34:22 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Konstantin Belousov	a66d7a8ddc	Copyout(9) on 4/4 i386 needs correct vm_page_array[]. On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold() on arbitrary userspace mapping. If the mapping is backed by the non-managed cdev pager or by the sg pager, on dense configs we might access arbitrary element of vm_page_array[], in particular, not corresponding to a page from the memory segment. Initialize such pages as fictitious with the corresponding physical address. Reported by: bde Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D16085	2018-07-05 16:43:15 +00:00
Alan Cox	89ea39a727	Update the physical page selection strategy used by vm_page_import() so that it does not cause rapid fragmentation of the free physical memory. Reviewed by: jeff, markj (an earlier version) Differential Revision: https://reviews.freebsd.org/D15976	2018-06-26 18:29:56 +00:00
Jonathan T. Looney	16e05b3275	Fix a typo in vm_domain_set(). When a domain crosses into the severe range, we need to set the domain bit from the vm_severe_domains bitset (instead of clearing it). Reviewed by: jeff, markj Sponsored by: Netflix, Inc.	2018-06-07 13:29:54 +00:00
Mark Johnston	a99ee60b9a	Ensure that "m" is initialized in vm_page_alloc_freelist_domain(). While here, remove a superfluous comment. Coverity CID: 1383559 MFC after: 3 days	2018-05-22 16:19:48 +00:00
Mark Johnston	ba2b3349e1	Fix a race in vm_page_pagequeue_lockptr(). The value of m->queue must be cached after comparing it with PQ_NONE, since it may be concurrently changing. Reported by: glebius Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D15462	2018-05-17 04:27:08 +00:00
Mark Johnston	1b5c869d64	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280	2018-05-04 17:17:30 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Mark Johnston	64b3893010	Initialize marker pages in vm_page_domain_init(). They were previously initialized by the corresponding page daemon threads, but for vmd_inacthead this may be too late if vm_page_deactivate_noreuse() is called during boot. Reported and tested by: cperciva Reviewed by: alc, kib MFC after: 1 week	2018-04-19 14:09:44 +00:00
Mark Johnston	9de8fcfddf	Ensure that m and skip_m belong to the same object. Pages allocated from a given reservation may belong to different objects. It is therefore possible for vm_page_ps_test() to be called with the base page's object unlocked. Check for this case before asserting that the object lock is held. Reported by: jhb Reviewed by: kib MFC after: 1 week	2018-04-17 18:49:17 +00:00
Konstantin Belousov	e55d32b7b3	Handle Skylake-X errata SKZ63. SKZ63 Processor May Hang When Executing Code In an HLE Transaction Region Problem: Under certain conditions, if the processor acquires an HLE (Hardware Lock Elision) lock via the XACQUIRE instruction in the Host Physical Address range between 40000000H and 403FFFFFH, it may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS. Move the pages from the range into the blacklist. Add a tunable to not waste 4M if local DoS is not the issue. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15001	2018-04-07 17:06:13 +00:00
Jeff Roberson	c33e3a642b	Add a uma cache of free pages in the DEFAULT freepool. This gives us per-cpu alloc and free of pages. The cache is filled with as few trips to the phys allocator as possible by the use of a new vm_phys_alloc_npages() function which allocates as many as N pages. This code was originally by markj with the import function rewritten by me. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14905	2018-04-01 04:50:05 +00:00
Jeff Roberson	e5818a53db	Implement several enhancements to NUMA policies. Add a new "interleave" allocation policy which stripes pages across domains with a stride or width keeping contiguity within a multi-page region. Move the kernel to the dedicated numbered cpuset #2 making it possible to assign kernel threads and memory policy separately from user. This also eliminates the need for the complicated interrupt binding code. Add a sysctl API for viewing and manipulating domainsets. Refactor some of the cpuset_t manipulation code using the generic bitset type so that it can be used for both. This probably belongs in a dedicated subr file. Attempt to improve the include situation. Reviewed by: kib Discussed with: jhb (cpuset parts) Tested by: pho (before review feedback) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14839	2018-03-29 02:54:50 +00:00
Jeff Roberson	2d3f4181de	Fix two compliation problems on non-amd64 architectures.	2018-03-23 18:24:02 +00:00
Mark Johnston	4046851367	Correct a couple of assertion messages in vm_page_reclaim_run(). MFC after: 3 days	2018-03-23 14:38:56 +00:00
Jeff Roberson	5c930c894d	Lock reservations with a dedicated lock in each reservation. Protect the vmd_free_count with atomics. This allows us to allocate and free from reservations without the free lock except where a superpage is allocated from the physical layer, which is roughly 1/512 of the operations on amd64. Use the counter api to eliminate cache conention on counters. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14707	2018-03-22 19:21:11 +00:00
Jeff Roberson	9a4b4cd3bc	Start witness much earlier in boot so that we can shrink the pend list and make it more immune to further change. Reviewed by: markj, imp (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:11:43 +00:00
Mark Johnston	0eb50f9cd2	Have vm_page_{deactivate,launder}() requeue already-queued pages. In many cases the page is not enqueued so the change will have no effect. However, the change is needed to support an optimization in the fault handler and in some cases (sendfile, the buffer cache) it was being emulated by the caller anyway. Reviewed by: alc Tested by: pho MFC after: 2 weeks X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:40:56 +00:00
Mark Johnston	434862acb1	Have vm_page_replace() assert that the new page is not enqueued. The new page does not belong to a VM object, but the page daemon does not expect to encounter such pages. Reviewed by: alc, kib Tested by: pho MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:35:40 +00:00
Jeff Roberson	30fbfdda6c	Eliminate pageout wakeup races. Take another step towards lockless vmd_free_count manipulation. Reduce the scope of the free lock by using a pageout lock to synchronize sleep and wakeup. Only trigger the pageout daemon on transitions between states. Drive all wakeup operations directly as side-effects from freeing memory rather than requiring an additional function call. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14612	2018-03-15 19:23:07 +00:00
Konstantin Belousov	741e1c9196	Revert the chunk from r330410 in vm_page_reclaim_run(). There, the pages freed might be managed but the page's lock is not owned. For KPI correctness, the page lock is requried around the call to vm_page_free_prep(), which is asserted. Reclaim loop already did the work which could be done by vm_page_free_prep(), so the lock is not needed and the only consequence of not owning it is the assert trigger. Instead of adding the locking to satisfy the assert, revert to the code that calls vm_page_free_phys() directly. Reported by: pho Discussed with: jeff Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-13 18:27:23 +00:00
Konstantin Belousov	2a8e8f7892	Remove redundant test from r330410. If the input slist is non-empty, counter cannot be zero after freeing. Noted by: mjg MFC after: 2 weeks	2018-03-04 21:15:31 +00:00
Konstantin Belousov	8c8ee2ee1c	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485	2018-03-04 20:53:20 +00:00
Mark Johnston	9140bff7ed	Remove a bogus assertion from vm_page_launder(). After r328977, a wired page m may have m->queue != PQ_NONE. Reviewed by: kib X-MFC with: r328977 Differential Revision: https://reviews.freebsd.org/D14485	2018-02-23 23:25:22 +00:00
Jeff Roberson	5f8cd1c0bf	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402	2018-02-23 22:51:51 +00:00
Konstantin Belousov	2c0f13aa59	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
Jeff Roberson	e958ad4cf3	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
Gleb Smirnoff	f7d3578564	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
Gleb Smirnoff	5073a08328	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
Mark Johnston	1d3a1bcfac	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943	2018-02-07 16:57:10 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Gleb Smirnoff	ae941b1b4e	Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC. o Call uma_startup1() after initializing kmem, vmem and domains. o Include 8 eight VM startup pages into uma_startup_count() calculation. o Account for vmem_startup() and vm_map_startup() preallocating pages. o Account for extra two allocations done by kmem_init() and vmem_create(). o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT allowed several other SYSINITs to sneak in before it, thus bumping requirement for amount of boot pages.	2018-02-06 22:06:59 +00:00
Gleb Smirnoff	f4bef67c9c	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
Nathan Whitehorn	9a8196ce19	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
Jeff Roberson	3f289c3fcf	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
Mark Johnston	280d15cd0a	Fix two problems with the page daemon control loop. Both issues caused the page daemon to erroneously go to sleep when applications are consuming free pages at a high rate, leaving the application threads blocked in VM_WAIT. 1) After completing an inactive queue scan, concurrent allocations may have prevented the page daemon from meeting the v_free_min threshold. In this case, the page daemon was going to sleep even when the inactive queue contained plenty of clean pages. 2) pagedaemon_wakeup() may be called without the free queues lock held. This can lead to a lost wakeup if a call occurs after the page daemon clears vm_pageout_wanted but before going to sleep. Fix 1) by ensuring that we start a new inactive queue scan immediately if v_free_count < v_free_min after a prior scan. Fix 2) by adding a new subroutine, pagedaemon_wait(), called from vm_wait() and vm_waitpfault(). It wakes up the page daemon if either vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps on v_free_count. Reported by: jeff Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13424	2017-12-24 19:45:16 +00:00
Michael Zhilin	0db2102aaa	[mips] [vm] restore translation of freelist to flind for page allocation Commit r326346 moved domain iterators from physical layer to vm_page one, but it also removed translation of freelist to flind for vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter, but after it expect freelist index. On small WiFi boxes with few megabytes of RAM, there is only one freelist VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file sys/mips/include/vmparam.h). It results in freelist 1 with flind 0. At first, this commit renames flind to freelist in vm_page_alloc_freelist to avoid misunderstanding about input parameters. Then on physical layer it restores translation for correct handling of freelist parameter. Reported by: landonf Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D13351	2017-12-04 08:08:55 +00:00
Pedro F. Giffuni	796df753f4	SPDX: Consider code from Carnegie-Mellon University. Interesting cases, most likely from CMU Mach sources.	2017-11-30 15:48:35 +00:00
Mark Johnston	1084894f80	Remove some comments that became incorrect with r325530.	2017-11-29 14:34:05 +00:00
Jeff Roberson	ef435ae7de	Move domain iterators into the page layer where domain selection should take place. This makes the majority of the phys layer explicitly domain specific. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix & Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13014	2017-11-28 23:18:35 +00:00
Mark Johnston	d2b677cef6	Avoid unnecessary lookups when initializing the vm_page array. This gives a marginal improvement in the vm_page_array initialization time. Also garbage-collect the now-unused vm_phys_paddr_to_segind(). Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13270	2017-11-27 17:46:38 +00:00
Mark Johnston	b20bf182e6	Move vm_phys_init_page() to vm_page.c. Suggested by: kib Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13250	2017-11-26 19:17:55 +00:00
Mark Johnston	5070d56d41	Allow for fictitious physical pages in vm_page_scan_contig(). Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter fictitous pages. For now, allow for this possibility; such pages will be skipped later in the scan since they are wired. Reported by: avg Reviewed by: kib MFC after: 1 week	2017-11-21 13:17:40 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Jeff Roberson	8d6fbbb867	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
Alan Cox	3a757e5403	Micro-optimize the handling of fictitious pages in vm_page_free_prep(). A fictitious page is always wired, so there is no point in trying to remove one from the page queues. Completely remove one inaccurate comment from vm_page_free_prep() and correct another. Reviewed by: kib, markj MFC after: 1 week	2017-10-24 17:14:53 +00:00
Konstantin Belousov	422fe502b3	Check that the page which is freed as zeroed, indeed has all-zero content. This catches some rare mysterious failures at the source. The check is only performed on architectures which implement direct map, and only enabled with option DIAGNOSTIC, similar to other costly consistency checks. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-10-21 17:28:12 +00:00
Konstantin Belousov	4313989360	In vm_page_free_phys_pglist(), do not take vm_page_queue_free_mtx if there is nothing to do. Suggested by: mjg Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-20 08:25:49 +00:00
Mateusz Guzik	1dbf52e7d9	Reduce traffic on vm_cnt.v_free_count The variable is modified with the highly contended page free queue lock. It unnecessarily shares a cacheline with purely read-only fields and is re-read after the lock is dropped in the page allocation code making the hold time longer. Pad the variable just like the others and store the value as found with the lock held instead of re-reading. Provides a modest 1%-ish speed up in concurrent page faults. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D12665	2017-10-13 21:54:34 +00:00
Alan Cox	43cc906f40	Change vm_page_try_to_free() to require a managed page. Essentially, vm_page_try_to_free() is testing conditions, like clean versus dirty, that only vary in managed pages. Suggested by: kib Reviewed by: markj X-MFC after: never	2017-09-24 23:35:01 +00:00
Alan Cox	494c6e43d3	Optimize vm_page_try_to_free(). Specifically, the call to pmap_remove_all() can be avoided when the page's containing object has a reference count of zero. (If the object has a reference count of zero, then none of its pages can possibly be mapped.) Address nearby style issues in vm_page_try_to_free(), and change its return type to "bool". Reviewed by: kib, markj MFC after: 1 week	2017-09-24 16:50:10 +00:00
Konstantin Belousov	e82e50e681	Remove inline specifier from vm_page_free_wakeup(), do not micro-manage compiler. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:30:09 +00:00
Konstantin Belousov	540ac3b310	Split vm_page_free_toq() into two parts, preparation vm_page_free_prep() and insertion into the phys allocator free queues vm_page_free_phys(). Also provide a wrapper vm_page_free_phys_pglist() for batched free. Reviewed by: alc, markj Tested by: mjg (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:11:52 +00:00
Mark Johnston	2934eb8a22	Fix a logic error in the item size calculation for internal UMA zones. Kegs for internal zones always keep the slab header in the slab itself. Therefore, when determining the allocation size, we need to take the slab header size into account. Reported and tested by: ae, rakuco Reviewed by: avg MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12342	2017-09-13 15:44:54 +00:00
Konstantin Belousov	93c5d3a46a	Add a vm_page_change_lock() helper, the common code to not relock page lock if both old and new pages use the same underlying lock. Convert existing places to use the helper instead of inlining it. Use the optimization in vm_object_page_remove(). Suggested and reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-09 17:35:19 +00:00
Mark Johnston	f93f7cf199	Speed up vm_page_array initialization. We currently initialize the vm_page array in three passes: one to zero the array, one to initialize the "order" field of each page (necessary when inserting them into the vm_phys buddy allocator one-by-one), and one to initialize the remaining non-zero fields and individually insert each page into the allocator. Merge the three passes into one following a suggestion from alc: initialize vm_page fields in a single pass, and use vm_phys_free_contig() to efficiently insert physical memory segments into the buddy allocator. This reduces the initialization time to a third or a quarter of what it was before on most systems that I tested. Reviewed by: alc, kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D12248	2017-09-07 21:43:39 +00:00
Mateusz Guzik	fe933c1d88	Start annotating global _padalign locks with __exclusive_cache_line While these locks are guarnteed to not share their respective cache lines, their current placement leaves unnecessary holes in lines which preceeded them. For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour cachelines (previously separate by the lock) to be collapsed into 1. The annotation is only effective on architectures which have it implemented in their linker script (currently only amd64). Thus locks are not converted to their not-padaligned variants as to not affect the rest. MFC after: 1 week	2017-09-06 20:28:18 +00:00
Mark Johnston	33fff5d536	Add vm_page_alloc_after(). This is a variant of vm_page_alloc() which accepts an additional parameter: the page in the object with largest index that is smaller than the requested index. vm_page_alloc() finds this page using a lookup in the object's radix tree, but in some cases its identity is already known, allowing the lookup to be elided. Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after(). vm_page_alloc() is converted into a trivial wrapper of vm_page_alloc_after(). Suggested by: alc Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11984	2017-08-15 16:39:49 +00:00
Mark Johnston	9df950b35d	Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT. This will allow its use in sendfile_swapin(). Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11942	2017-08-11 16:29:22 +00:00
Mark Johnston	2c642ec1e7	Make vm_page_sunbusy() assert that the page is unlocked. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11946	2017-08-10 22:43:38 +00:00
Alan Cox	5471caf6f1	Introduce vm_page_grab_pages(), which is intended to replace loops calling vm_page_grab() on consecutive page indices. Besides simplifying the code in the caller, vm_page_grab_pages() allows for batching optimizations. For example, the current implementation replaces calls to vm_page_lookup() on consecutive page indices by cheaper calls to vm_page_next(). Reviewed by: kib, markj Tested by: pho (an earlier version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11926	2017-08-09 04:23:04 +00:00
Alan Cox	1d3b9818e7	In vm_page_ps_test(), always check that the base pages within the specified superpage all belong to the same object. To date, that check has not been needed, but upcoming changes require it. (See the Differential Revision.) Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11556	2017-07-23 05:54:56 +00:00
Alan Cox	8830260128	Generalize vm_page_ps_is_valid() to support testing other predicates on the (super)page, renaming the function to vm_page_ps_test(). Reviewed by: kib, markj MFC after: 1 week	2017-07-14 02:15:48 +00:00
John Baldwin	4bd7e351f1	Fix an off-by-one error in the VM page array on some systems. r31386 changed how the size of the VM page array was calculated to be less wasteful. For most systems, the amount of memory is divided by the overhead required by each page (a page of data plus a struct vm_page) to determine the maximum number of available pages. However, if the remainder for the first non-available page was at least a page of data (so that the only memory missing was a struct vm_page), this last page was left in phys_avail[] but was not allocated an entry in the VM page array. Handle this case by explicitly excluding the page from phys_avail[]. Reviewed by: alc Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D11000	2017-06-08 16:18:41 +00:00
Gleb Smirnoff	83c9dea1ba	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Alan Cox	8a99f1cc59	Over the years, the code and comments in vm_page_startup() have diverged in one respect. When determining how many page structures to allocate, contrary to what the comments say, the code does not account for the overhead of a page structure per page of physical memory. This revision changes the code to match the comments. Reviewed by: kib, markj MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D9081	2017-02-04 05:23:10 +00:00
Mark Johnston	c2655a40a7	Avoid unnecessary page lookups in vm_object_madvise(). vm_object_madvise() is frequently used to apply advice to a contiguous set of pages in an object with no backing object. Optimize this case by skipping non-resident subranges in constant time, and by iterating over resident pages using the object memq, thus avoiding radix tree lookups on each page index in the specified range. While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the "advise" parameter to vm_object_madvise() to "advice." Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9098	2017-01-15 03:50:08 +00:00
Gleb Smirnoff	bfc8c24c73	Move bogus_page declaration to vm_page.h and initialization to vm_page.c. Reviewed by: kib	2017-01-04 22:27:19 +00:00
Mark Johnston	b1fd102ee7	Add a page queue for holding dirty anonymous unswappable pages. On systems without a configured swap device, an attempt to launder pages from a swap object will always fail and result in the page being reactivated. This means that the page daemon will continuously scan pages that can never be evicted. With this change, anonymous pages are instead moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device is configured, so unreferenced unswappable pages are excluded from the page daemon's workload. Reviewed by: alc	2017-01-03 00:05:44 +00:00
Konstantin Belousov	0c8bd6a7d8	Assert that the pages found on the object queue by vm_page_next() and vm_page_prev() have correct ownership. In collaboration with: alc Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2016-12-30 17:37:06 +00:00
Alan Cox	920da7e4d2	Relax the object type restrictions on vm_page_alloc_contig(). Specifically, add support for object types that were previously prohibited because they could contain PG_CACHED pages. Roughly halve the number of radix trie operations performed by vm_page_alloc_contig() using the same approach that is employed by vm_page_alloc(). Also, eliminate the radix trie lookup performed with the free page queues lock held. Tidy up the handling of radix trie insert failures in vm_page_alloc() and vm_page_alloc_contig(). Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8878	2016-12-28 18:32:13 +00:00
Alan Cox	3453bca864	Eliminate every mention of PG_CACHED pages from the comments in the machine- independent layer of the virtual memory system. Update some of the nearby comments to eliminate redundancy and improve clarity. In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per The Chicago Manual of Style. Update the comment in vm/vm_page.h defining the four types of page queues to reflect the elimination of PG_CACHED pages and the introduction of the laundry queue. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8752	2016-12-12 17:47:09 +00:00
Alan Cox	e94965d82e	Previously, vm_radix_remove() would panic if the radix trie didn't contain a vm_page_t at the specified index. However, with this change, vm_radix_remove() no longer panics. Instead, it returns NULL if there is no vm_page_t at the specified index. Otherwise, it returns the vm_page_t. The motivation for this change is that it simplifies the use of radix tries in the amd64, arm64, and i386 pmap implementations. Instead of performing a lookup before every remove, the pmap can simply perform the remove. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D8708	2016-12-08 04:29:29 +00:00
Alan Cox	ba67369628	Recursion on the free page queue mutex occurred when UMA needed to allocate a new page of radix trie nodes to complete a vm_radix_insert() operation that was requested by vm_page_cache(). Specifically, vm_page_cache() already held the free page queue lock when UMA tried to acquire it through a call to vm_page_alloc(). This code path no longer exists, so there is no longer any reason to allow recursion on the free page queue mutex. Improve nearby comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8628	2016-11-27 01:42:53 +00:00
Alan Cox	bba39b9ae3	Remove PG_CACHED-related fields from struct vmmeter, because they are no longer used. More precisely, they are always zero because the code that decremented and incremented them no longer exists. Bump __FreeBSD_version to mark this change. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8583	2016-11-22 18:13:46 +00:00
Alan Cox	7667839a7e	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497	2016-11-15 18:22:50 +00:00
Alan Cox	ebcddc7217	Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302	2016-11-09 18:48:37 +00:00
Konstantin Belousov	1771e987ca	Do not sleep in vm_wait() if pagedaemon did not yet started. Panic instead. Requests which cannot be satisfied by allocators at boot time often have unrealizable parameters. Waiting for the pagedaemon' start would hang the boot if done in the thread0 context and just never succeed if executed from another thread. In fact, for very early stages, sleep attempt panics with obscure diagnostic about the scheduler state, and explicit panic in vm_wait() makes the investigation much shorter by cut off the examination of the thread and scheduler. Theoretically, some subsystem might grab a resource to exhaustion, and free it later in the boot process. If this unlikely scenario does appear for real, the way to diagnose the trouble can be revisited. Reported by: emaste Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8421	2016-11-04 12:58:50 +00:00
Konstantin Belousov	bd9546a21c	Export vm_page_xunbusy_maybelocked(). Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D8197	2016-10-17 08:14:23 +00:00
Konstantin Belousov	5975e53d40	Fix a race in vm_page_busy_sleep(9). Suppose that we have an exclusively busy page, and a thread which can accept shared-busy page. In this case, typical code waiting for the page xbusy state to pass is again: VM_OBJECT_WLOCK(object); ... if (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(object); <---1 vm_page_busy_sleep(p, "vmopax"); goto again; } Suppose that the xbusy state owner locked the object, unbusied the page and unlocked the object after we are at the line [1], but before we executed the load of the busy_lock word in vm_page_busy_sleep(). If it happens that there is still no waiters recorded for the busy state, the xbusy owner did not acquired the page lock, so it proceeded. More, suppose that some other thread happen to share-busy the page after xbusy state was relinquished but before the m->busy_lock is read in vm_page_busy_sleep(). Again, that thread only needs vm_object lock to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to the VPB_SHARERS_WORD(1). In this case, all tests in vm_page_busy_sleep(9) pass and we are going to sleep, despite the page being share-busied. Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9) to also accept shared-busy state if we only wait for the xbusy state to pass. Merge sequential if()s with the same 'then' clause in vm_page_busy_sleep(). Note that the current code does not share-busy pages from parallel threads, the only way to have more that one sbusy owner is right now is to recurse. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8196	2016-10-13 14:41:05 +00:00
Konstantin Belousov	267ed8e2f7	When downgrading exclusively busied page to shared-busy state, wakeup waiters. Otherwise, owners of the shared-busy state are left blocked and might get into a deadlock. Note that the vm_page_busy_downgrade() function is not used in the tree right now. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8195	2016-10-11 18:09:37 +00:00
Alan Cox	70cf3ced3c	Make the page daemon's notion of what kind of pass is being performed by vm_pageout_scan() local to vm_pageout_worker(). There is no reason to store the pass in the NUMA domain structure. Reviewed by: kib MFC after: 3 weeks	2016-10-05 17:32:06 +00:00
Mark Johnston	dbbaf04f1e	Remove support for idle page zeroing. Idle page zeroing has been disabled by default on all architectures since r170816 and has some bugs that make it seemingly unusable. Specifically, the idle-priority pagezero thread exacerbates contention for the free page lock, and yields the CPU without releasing it in non-preemptive kernels. The pagezero thread also does not behave correctly when superpage reservations are enabled: its target is a function of v_free_count, which includes reserved-but-free pages, but it is only able to zero pages belonging to the physical memory allocator. Reviewed by: alc, imp, kib Differential Revision: https://reviews.freebsd.org/D7714	2016-09-03 20:38:13 +00:00
Mark Johnston	915d1b71cd	Restore swap pager readahead after r292373. The removal of vm_fault_additional_pages() meant that a hard fault on a swap-backed page would result in only that page being read in. This change implements readahead and readbehind for the swap pager in swap_pager_getpages(). swap_pager_haspage() is modified to return the largest contiguous non-resident range of pages containing the requested range. Reviewed by: alc, kib Tested by: pho MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7677	2016-08-30 05:56:21 +00:00
Mark Johnston	842ee21e20	Strengthen assertions about the busy state of newly-allocated pages. Reviewed by: alc MFC after: 1 week	2016-08-13 19:49:32 +00:00
Mark Johnston	897d0c6617	Use vm_page_undirty() instead of manually setting a page field. Reviewed by: alc MFC after: 3 days	2016-07-29 21:05:37 +00:00
Mark Johnston	efe1ff4cf0	Update a comment in vm_page_advise() to match behaviour after r290529. Reviewed by: alc MFC after: 3 days	2016-07-23 21:02:36 +00:00
Colin Percival	34caa842a4	Autotune the number of pages set aside for UMA startup based on the number of CPUs present. On amd64 this unbreaks the boot for systems with 92 or more CPUs; the limit will vary on other systems depending on the size of their uma_zone and uma_cache structures. The major consumer of pages during UMA startup is the 19 zone structures which are set up before UMA has bootstrapped itself sufficiently to use the rest of the available memory: UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 / 16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP, KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg. If the zone structures occupy more than one page, they will not share pages and the number of pages currently needed for startup is 19 * pages_per_zone + N, where N is the number of pages used for allocating other structures; on amd64 N = 3 at present (2 pages are allocated for UMA Kegs, and one page for UMA Hash). This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32, and if a zone structure does not fit into a single page sets boot_pages to UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which remains at 64). Consequently this patch has no effect on systems where the zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but increases boot_pages sufficiently on systems where the large number of CPUs makes this structure larger. It seems safe to assume that systems with 60+ CPUs can afford to set aside an additional 128kB of memory per 32 CPUs. The vm.boot_pages tunable continues to override this computation, but is unlikely to be necessary in the future. Tested on: EC2 x1.32xlarge Relnotes: FreeBSD can now boot on 92+ CPU systems without requiring vm.boot_pages to be manually adjusted. Reviewed by: jeff, alc, adrian Approved by: re (kib)	2016-07-07 18:37:12 +00:00
Konstantin Belousov	35e8002c58	In vm_page_xunbusy_maybelocked(), add fast path for unbusy when no waiters exist, same as for vm_page_xunbusy(). If previous value of busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is not needed. Move common code from vm_page_xunbusy_maybelocked() and vm_page_xunbusy_hard() to vm_page_xunbusy_locked(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-23 08:28:13 +00:00
Mark Johnston	0a1dc6e23c	Reset the page busy lock state after failing to insert into the object. Freeing a shared-busy page is not permitted. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6670	2016-06-02 17:11:24 +00:00
Mark Johnston	e705296958	Don't preserve the page's object linkage in vm_page_insert_after(). Per the KASSERT at the beginning of the function, we expect that the page does not belong to any object, so its object and pindex fields are meaningless. Reset them in the rare case that vm_radix_insert() fails. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6669	2016-06-02 16:58:47 +00:00
Konstantin Belousov	e5f0191f20	If the fast path unbusy in vm_page_replace() fails, slow path needs to acquire the page lock, which recurses. Avoid the recursion by reusing the code from vm_page_remove() in a new helper vm_page_xunbusy_maybelocked(). Reviewed by: alc Sponsored by: The FreeBSD Foundation	2016-06-01 20:39:00 +00:00

1 2 3 4 5 ...

814 Commits