freebsd-skq

Author	SHA1	Message	Date
kib	b5cd5f8b75	Check for wrap-around in vm_phys_alloc_seg_contig(). It is possible to provide insane values for size in contigmalloc(9) request, which usually not reaches the phys allocator due to failing KVA allocation. But with the forthcoming 4/4 i386, where 32bit architecture has almost 4G KVA, contigmalloc(1G) is not unreasonable outright and KVA might be available sometimes. Then, the calculation of pa_end could wrap around, depending on the physical address, and the checks in vm_phys_alloc_seg_contig() would pass while the iteration in the loop after the 'done' label goes out of the vm_page_array bounds. Fix it by detecting the wrap. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14767	2018-03-20 16:17:55 +00:00
markj	2172a46042	Avoid dequeuing the fault page during a soft fault. Such pages are re-enqueued at the end of the fault handler, preserving LRU. Rather than performing two separate operations per fault, simply requeue the page at the end of the fault (or bump its activation count if it resides in PQ_ACTIVE, avoiding the page queue lock entirely). This elides some page lock and page queue lock operations in common cases, e.g., CoW faults. Note that we must still dequeue the source page for "optimized" CoW faults since the page may not remain enqueued while it is moved to another object. Reviewed by: alc, kib Tested by: pho MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:49:30 +00:00
markj	3394e82adc	Have vm_page_{deactivate,launder}() requeue already-queued pages. In many cases the page is not enqueued so the change will have no effect. However, the change is needed to support an optimization in the fault handler and in some cases (sendfile, the buffer cache) it was being emulated by the caller anyway. Reviewed by: alc Tested by: pho MFC after: 2 weeks X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:40:56 +00:00
markj	74019f419c	Have vm_page_replace() assert that the new page is not enqueued. The new page does not belong to a VM object, but the page daemon does not expect to encounter such pages. Reviewed by: alc, kib Tested by: pho MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:35:40 +00:00
cem	2344d24206	Fix GCC build: Remove redundant pagedaemon_wakeup declaration Introduced in r331018. Reported by: kevans Sponsored by: Dell EMC Isilon	2018-03-16 07:05:09 +00:00
jeff	1dfd513751	Eliminate pageout wakeup races. Take another step towards lockless vmd_free_count manipulation. Reduce the scope of the free lock by using a pageout lock to synchronize sleep and wakeup. Only trigger the pageout daemon on transitions between states. Drive all wakeup operations directly as side-effects from freeing memory rather than requiring an additional function call. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14612	2018-03-15 19:23:07 +00:00
kib	0050a49011	Revert the chunk from r330410 in vm_page_reclaim_run(). There, the pages freed might be managed but the page's lock is not owned. For KPI correctness, the page lock is requried around the call to vm_page_free_prep(), which is asserted. Reclaim loop already did the work which could be done by vm_page_free_prep(), so the lock is not needed and the only consequence of not owning it is the assert trigger. Instead of adding the locking to satisfy the assert, revert to the code that calls vm_page_free_phys() directly. Reported by: pho Discussed with: jeff Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-13 18:27:23 +00:00
jeff	ff01cfd694	Don't assert that the domain free lock is held until we're certain that there is a valid reservation. This can trip erroneously when memory falls within a domain but doesn't have the reservation initialized because it does not meet size or alignment requirements. Reported by: pho, mjg Sponsored by: Netflix, Dell/EMC Isilon	2018-03-07 22:04:27 +00:00
kib	3b2ac3ad1a	Remove redundant test from r330410. If the input slist is non-empty, counter cannot be zero after freeing. Noted by: mjg MFC after: 2 weeks	2018-03-04 21:15:31 +00:00
kib	d27cb27779	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485	2018-03-04 20:53:20 +00:00
markj	0c9122e5a7	Give the 0th domain's page daemon thread a consistent name. Page daemon threads for other domains show up in ps(1) output as "pagedaemon/domN", so let that be the case for domain 0 as well. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14518	2018-02-27 16:51:09 +00:00
markj	473b1fb5c3	Restore the pre-r329882 inactive page shortage computation. With r329882, in the absence of a free page shortage we would only take len(PQ_INACTIVE)+len(PQ_LAUNDRY) into account when deciding whether to aggressively scan PQ_ACTIVE. Previously we would also include the number of free pages in this computation, ensuring that we wouldn't scan PQ_ACTIVE with plenty of free memory available. The change in behaviour was most noticeable immediately after booting, when PQ_INACTIVE and PQ_LAUNDRY are nearly empty. Reviewed by: jeff	2018-02-24 20:47:22 +00:00
kib	6912c9b3af	Hide all vm/vm_pageout.h content under #ifdef _KERNEL. There are no parts useful for usermode applications in vm/vm_pageout.h. Even for the specific applications like fstat and lsof. In my opinion, this protection is redundant and instead userspace should not include the header at all. Since there are apparently broken third party codebases, give them a bit of slack by providing transitional period. Reported by: julian Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-24 10:26:26 +00:00
markj	8359115798	Correct some comments after r328954. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14486	2018-02-23 23:27:53 +00:00
markj	ee8c4b29df	Remove a bogus assertion from vm_page_launder(). After r328977, a wired page m may have m->queue != PQ_NONE. Reviewed by: kib X-MFC with: r328977 Differential Revision: https://reviews.freebsd.org/D14485	2018-02-23 23:25:22 +00:00
jeff	6e1b76d26b	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402	2018-02-23 22:51:51 +00:00
kib	ee3d0fb8ef	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
markj	d543f5d1ea	Use the conventional name for an array of pages. No functional change intended. Discussed with: kib MFC after: 3 days	2018-02-16 15:38:22 +00:00
kib	b93d5395e3	Cleanup unused page argument for vm_reserv_break(). Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14364	2018-02-14 00:34:02 +00:00
kib	c9f8f3e9be	Ensure memory consistency on COW. From the submitter description: The process is forked transitioning a map entry to COW Thread A writes to a page on the map entry, faults, updates the pmap to writable at a new phys addr, and starts TLB invalidations... Thread B acquires a lock, writes to a location on the new phys addr, and releases the lock Thread C acquires the lock, reads from the location on the old phys addr... Thread A ...continues the TLB invalidations which are completed Thread C ...reads from the location on the new phys addr, and releases the lock In this example Thread B and C [lock, use and unlock] properly and neither own the lock at the same time. Thread A was writing somewhere else on the page and so never had/needed the lock. Thread C sees a location that is only ever read\|modified under a lock change beneath it while it is the lock owner. To fix this, perform the two-stage update of the copied PTE. First, the PTE is updated with the address of the new physical page with copied content, but in read-only mode. The pmap locking and the page busy state during PTE update and TLB invalidation IPIs ensure that any writer to the page cannot upgrade the PTE to the writable state until all CPUs updated their TLB to not cache old mapping. Then, after the busy state of the page is lifted, the faults for write can proceed and do not violate the consistency of the reads. The change is done in vm_fault because most architectures do need IPIs to invalidate remote TLBs. More, I think that hardware guarantees of atomicity of the remote TLB invalidation are not enough to prevent the inconsistent reads of non-atomic reads, like multi-word accesses protected by a lock. So instead of modifying each pmap invalidation code, I did it there. Discovered and analyzed by: Elliott.Rabe@dell.com Reviewed by: markj PR: 225584 (appeared to have the same cause) Tested by: Elliott.Rabe@dell.com, emaste, Mike Tancsa <mike@sentex.net>, truckman Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:31:45 +00:00
kib	f2d6aac90a	Do not call pmap_enter() with invalid protection mode. If the map entry elookup was performed due to the mapping changes, we need to ensure that there is still some access permission bit requested which is compatible with the current vm_map_entry mode. If not, restart the handler from scratch instead of trying to save the current progress. Also adjust fault_type to not include cleared permission bits. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:25:18 +00:00
kib	0c20f07bdd	Do not leak rv->psind in some specific situations. Suppose that we have an object with a mapped superpage, and that all pages in the superpages are held (by some driver). Additionally, suppose that the object is terminated, e.g. because the only process mapping it is exiting. Then the reservation is broken, but the pages cannot be freed until later, when they are unheld. In this situation, the reservation code cannot clean psind, since no pages are freed, and the page is freed and then reused with invalid psind. Clean psind on vm_reserv_break() to avoid the situation. Reported and tested by: Slava Shwartsman Reviewed by: markj Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14335	2018-02-13 15:36:28 +00:00
jeff	ba27b5187b	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
glebius	2c573b6df7	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
glebius	ed8d237f4f	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
markj	d241ba38fa	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943	2018-02-07 16:57:10 +00:00
glebius	8f926da325	Use correct arithmetic to calculate how many pages we need for kegs and hashes. There is no functional change with current sizes.	2018-02-06 22:13:40 +00:00
jeff	e67ec0d694	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
glebius	7fbf23740f	Improve DIAGNOSTIC printf. Report using a boot page every time regardless of booted status.	2018-02-06 22:08:43 +00:00
glebius	59ed8012f8	Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC. o Call uma_startup1() after initializing kmem, vmem and domains. o Include 8 eight VM startup pages into uma_startup_count() calculation. o Account for vmem_startup() and vm_map_startup() preallocating pages. o Account for extra two allocations done by kmem_init() and vmem_create(). o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT allowed several other SYSINITs to sneak in before it, thus bumping requirement for amount of boot pages.	2018-02-06 22:06:59 +00:00
markj	f54f9b3947	Delete a declaration for a variable removed in r305362.	2018-02-06 17:26:11 +00:00
glebius	baecd1a4b3	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
kib	e676dae295	On munlock(), unwire correct page. It is possible, for complex fork()/collapse situations, to have sibling address spaces to partially share shadow chains. If one sibling performs wiring, it can happen that a transient page, invalid and busy, is installed into a shadow object which is visible to other sibling for the duration of vm_fault_hold(). When the backing object contains the valid page, and the wiring is performed on read-only entry, the transient page is eventually removed. But the sibling which observed the transient page might perform the unwire, executing vm_object_unwire(). There, the first page found in the shadow chain is considered as the page that was wired for the mapping. It is really the page below it which is wired. So we unwire the wrong page, either triggering the asserts of breaking the page' wire counter. As the fix, wait for the busy state to finish if we find such page during unwire, and restart the shadow chain walk after the sleep. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14184	2018-02-05 12:49:20 +00:00
kib	773ad4ba11	On pageout, in vnode generic pager, for partially dirty page, only clear dirty bits for completely invalid blocks. Otherwise we might not write out the last chunk that is shorter than 512 bytes, if the file end is not aligned on disk block boundary. This become important after the r324794. PR: 225586 Reported by: tris_vern@hotmail.com Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-02-02 11:56:30 +00:00
kib	93e18e3197	Assign map->header values to avoid boundary checks. In several places, entry start and end field are checked, after excluding the possibility that the entry is map->header. By assigning max and min values to the start and end fields of map->header in vm_map_init, the explicit map->header checks become unnecessary. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: alc, kib, markj (previous version) Tested by: pho (previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13735	2018-01-20 12:19:02 +00:00
nwhitehorn	e79f2b9178	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
jeff	cc3d6a3370	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
jeff	bc9177f3a2	Add support for NUMA domains to bus dma tags. This causes all memory allocated with a tag to come from the specified domain if it meets the other constraints provided by the tag. Automatically create a tag at the root of each bus specifying the domain local to that bus if available. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13545	2018-01-12 23:34:16 +00:00
jeff	f375b4dd66	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
jeff	e7c9f84113	Implement NUMA policy for kmem_*(9). This maintains compatibility with reservations by giving each memory domain its own KVA space in vmem that is naturally aligned on superpage boundaries. Reviewed by: alc, markj, kib (some objections) Sponsored by: Netflix, Dell/EMC Isilon Tested by; pho Differential Revision: https://reviews.freebsd.org/D13289	2018-01-12 23:13:55 +00:00
jeff	01c8e28f80	Add files for r327895 Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:57:57 +00:00
jeff	94c7af8ca2	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
emaste	d5f26cfbaf	ANSIfy function definitions in sys/vm/	2018-01-12 03:50:44 +00:00
kib	565ce2a986	Restructure swapout tests after vm map locking was removed. Consolidate the regions covered by the process lock. Combine similar conditions tests into one, e.g. all process flags can be test with one logical operation. Add check for in-exec state, since p_vmspace is dererenced. Remove labels and goto by explicitly tracking state. Update comments. Reviewed by: alc, markj (previous version) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13693	2018-01-04 18:14:58 +00:00
alc	21db82b824	Once we have decided to swap out a process, don't delay the laundering of its per-thread kernel stack pages by making them pass through the inactive queue first. Instead, immediately place them in the laundry so that they might be cleaned and made available for reclamation sooner. Reviewed by: kib, markj MFC after: 1 week	2018-01-04 03:16:32 +00:00
jeff	c17fd15c00	Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494	2018-01-02 04:35:56 +00:00
kib	0312e7bc4d	Do not let vm_daemon run unbounded. On a load where single anonymous object consumes almost all memory on the large system, swapout code executes the iteration over the corresponding object page queue for long time, owning the map and object locks. This blocks pagedaemon which tries to lock the object, and blocks other threads in the process in vm_fault() waiting for the map lock. Handle the issue by terminating the deactivation loop if we executed too long and by yielding at the top level in vm_daemon. Reported by: peterj, pho Reviewed by: alc Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2018-01-01 19:27:33 +00:00
alc	c4f7f60d06	The variable "minslptime" is pointless and always has been, ever since its introduction in r83366. (At that time, this code appeared in vm/vm_glue.c, because vm/vm_swapout.c did not exist.) When the FOREACH_THREAD loop completes, we know that the sleep time for every thread is above whichever threshold is being applied. Reviewed by: kib X-MFC with: r327354	2017-12-31 21:36:42 +00:00
alc	a29455420d	Previously, swap_pager_copy() freed swap blocks one at at time, via swp_pager_meta_ctl(), with no opportunity to recognize freeing of consecutive blocks and free fewer block ranges. To open that opportunity, this change removes the SWM_FREE option from swp_pager_meta_ctl(), and compels the caller to do the freeing when a valid block address is returned. In swap_pager_copy(), these frees are aggregated, so that a sequence of them can be done at one time. The only other caller to swp_pager_meta_ctl() that passed SWM_FREE, swp_pager_unswapped(), is also modified to handle its single free explicitly. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib (an earlier version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13290	2017-12-31 04:01:47 +00:00
kib	4c4b40eeeb	Do not lock vm map in swapout_procs(). Neither swapout_procs() nor swapout() access the map. Since the process' vmspace is referenced only to obtain the pointer to the vm_map, the reference is not needed as well. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13681	2017-12-29 20:33:56 +00:00
kib	a1b116566d	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13678	2017-12-29 19:05:07 +00:00
alc	51e141cad5	After r327168, the variable "vm_pageout_wanted" can be static. MFC after: 2 weeks	2017-12-29 17:02:22 +00:00
kib	c8236b191f	Clean up the comment. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2017-12-28 23:50:21 +00:00
kib	d50d0b2935	In vm_swapout_map_deactivate_pages(), it is enough to lock the map for read. Reviewed by: alc, markj (as part of the larger patch) Tested by: pho (again, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2017-12-28 22:56:30 +00:00
alc	e5d6749a39	Refactor vm_map_find(), creating a separate function, vm_map_alignspace(), for finding aligned free space in the given map. With this change, we always return KERN_NO_SPACE when we fail to find free space. Whereas, previously, we might return KERN_INVALID_ADDRESS. Also, with this change, we explicitly check for address wrap, rather than relying upon the map's min and max addresses to establish sentinel-like regions. This refactoring was inspired by the problem that we addressed in r326098. Reviewed by: kib Tested by: pho Discussed with: markj MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D13346	2017-12-26 17:59:37 +00:00
markj	391d859963	Ensure that pass > 0 when starting a scan with vm_pages_needed == 1. Otherwise the page daemon will not reclaim pages and thus will not wake threads sleeping in VM_WAIT. Reported and tested by: pho Reviewed by: alc, kib X-MFC with: r327168 Differential Revision: https://reviews.freebsd.org/D13640	2017-12-26 16:29:39 +00:00
alc	5037be343c	Make the vm object bypass and collapse counters per CPU. Requested by: mjg Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13611	2017-12-25 19:36:04 +00:00
markj	1bd48ad884	Fix two problems with the page daemon control loop. Both issues caused the page daemon to erroneously go to sleep when applications are consuming free pages at a high rate, leaving the application threads blocked in VM_WAIT. 1) After completing an inactive queue scan, concurrent allocations may have prevented the page daemon from meeting the v_free_min threshold. In this case, the page daemon was going to sleep even when the inactive queue contained plenty of clean pages. 2) pagedaemon_wakeup() may be called without the free queues lock held. This can lead to a lost wakeup if a call occurs after the page daemon clears vm_pageout_wanted but before going to sleep. Fix 1) by ensuring that we start a new inactive queue scan immediately if v_free_count < v_free_min after a prior scan. Fix 2) by adding a new subroutine, pagedaemon_wait(), called from vm_wait() and vm_waitpfault(). It wakes up the page daemon if either vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps on v_free_count. Reported by: jeff Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13424	2017-12-24 19:45:16 +00:00
kib	d84b266e99	Perform all accesses to uma_reclaim_needed using atomic(9) KPI. Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 10:06:55 +00:00
markj	35e385af34	Use a dedicated counter for inactive queue scans. The laundry thread keeps track of the number of inactive queue scans performed by the page daemon, and was previously using the v_pdwakeups counter to count them. However, in some cases the inactive queue may be scanned multiple times after a single wakeup, so it's more accurate to use a dedicated counter. Reviewed by: alc, kib (previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13422	2017-12-11 15:33:24 +00:00
markj	3479530814	Fix the act_scan_laundry_weight mechanism. r292392 modified the active queue scan to weigh clean pages differently from dirty pages when attempting to meet the inactive queue target. When r306706 was merged into the PQ_LAUNDRY branch, this mechanism was broken. Fix it by scalaing the correct page shortage variable. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13423	2017-12-09 15:47:26 +00:00
markj	90ce7fadcd	Fix the UMA reclaim worker after r326347. atomic_set_*() sets a bit in the target memory location, so atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like it does. PR: 224080 Reviewed by: jeff, kib Differential Revision: https://reviews.freebsd.org/D13412	2017-12-07 19:38:09 +00:00
markj	01b67530d5	Use unique wait messages in the page daemon control loop. Discussed with: alc MFC after: 1 week	2017-12-06 18:36:54 +00:00
andrew	c4fd89c6ba	Print the correct value when freelist is out of range. Security: : Sponsored by: DARPA, AFRL	2017-12-04 11:16:51 +00:00
mizhka	20c3a101da	[mips] [vm] restore translation of freelist to flind for page allocation Commit r326346 moved domain iterators from physical layer to vm_page one, but it also removed translation of freelist to flind for vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter, but after it expect freelist index. On small WiFi boxes with few megabytes of RAM, there is only one freelist VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file sys/mips/include/vmparam.h). It results in freelist 1 with flind 0. At first, this commit renames flind to freelist in vm_page_alloc_freelist to avoid misunderstanding about input parameters. Then on physical layer it restores translation for correct handling of freelist parameter. Reported by: landonf Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D13351	2017-12-04 08:08:55 +00:00
kib	2f5704ab57	Add comment for vm_map_find_min(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days X-Differential revision: https://reviews.freebsd.org/D13155	2017-12-01 10:53:08 +00:00
pfg	155122ce53	SPDX: Consider code from Carnegie-Mellon University. Interesting cases, most likely from CMU Mach sources.	2017-11-30 15:48:35 +00:00
pfg	8aa269d57f	SPDX: wrong license.	2017-11-30 15:45:42 +00:00
markj	40d9c0da1e	Verify the object/vnode association after vget() in vm_pageout_clean(). It's theoretically possible for the vnode and object to be disassociated while locks are dropped around the vget() call, in which case we shouldn't proceed with laundering. Noted and reviewed by: kib MFC after: 1 week	2017-11-29 19:47:09 +00:00
markj	1643870966	Remove some comments that became incorrect with r325530.	2017-11-29 14:34:05 +00:00
jeff	990ca74cdc	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187	2017-11-28 23:40:54 +00:00
jeff	f93de233c6	Move domain iterators into the page layer where domain selection should take place. This makes the majority of the phys layer explicitly domain specific. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix & Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13014	2017-11-28 23:18:35 +00:00
alc	c123c7433b	When the swap pager allocates space on disk, it requests contiguous blocks in a single call to blist_alloc(). However, when it frees that space, it previously called blist_free() on each block, one at a time. With this change, the swap pager identifies ranges of contiguous blocks to be freed, and calls blist_free() once per range. In one extreme case, that is described in the review, the time to perform an munmap(2) was reduced by 55%. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12397	2017-11-28 17:46:03 +00:00
markj	8d4225b0be	Avoid unnecessary lookups when initializing the vm_page array. This gives a marginal improvement in the vm_page_array initialization time. Also garbage-collect the now-unused vm_phys_paddr_to_segind(). Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13270	2017-11-27 17:46:38 +00:00
pfg	78a6b08618	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.	2017-11-27 15:23:17 +00:00
markj	9bf667025b	Move vm_phys_init_page() to vm_page.c. Suggested by: kib Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13250	2017-11-26 19:17:55 +00:00
markj	963bf8102f	Remove unneeded initializations from vm_phys_init_page(). The page allocator always initializes the aflags and oflags fields. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13242	2017-11-26 19:16:45 +00:00
kib	1c96478ead	Return different error code for the guard page layout violation. On KERN_NO_SPACE error, as it is returned now, vm_map_find() continues the loop searching for the suitable range for the requested mapping with specific alignment. Since the vm_map_findspace() succesfully finds the same place, the loop never ends. The errors returned from vm_map_stack() completely repeat the behavior of vm_map_insert() now, as suggested by Alan. Reported by: Arto Pekkanen <aksyom@gmail.com> PR: 223732 Reviewed by: alc, markj Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D13186	2017-11-22 16:45:27 +00:00
alc	d21a0d7e26	When vm_map_find(find_space = VMFS_OPTIMAL_SPACE) fails to find space, a second scan of the address space with find_space = VMFS_ANY_SPACE is performed. Previously, vm_map_find() released and reacquired the map lock between the first and second scans. However, there is no compelling reason to do so. This revision modifies vm_map_find() to retain the map lock. Reviewed by: jhb, kib, markj MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D13155	2017-11-22 16:39:24 +00:00
markj	f761f23093	Allow for fictitious physical pages in vm_page_scan_contig(). Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter fictitous pages. For now, allow for this possibility; such pages will be skipped later in the scan since they are wired. Reported by: avg Reviewed by: kib MFC after: 1 week	2017-11-21 13:17:40 +00:00
pfg	4736ccfd9c	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
pfg	9da7bdde06	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
kib	ea07ce4ed0	vmtotal: extend memory counters to accomodate for current and future hardware sizes. 32bit counters already overflow on approachable virtual memory page counts, and soon would overflow on the physical pages counts as well. Bump sizes to 64bit types. Bump __FreeBSD_version. It is impossible to provide perfect backward ABI compat for this change. If a program requests an old structure, it can be detected by size. But if it queries the size first by passing NULL old req pointer, there is almost nothing we can do to detect the desired ABI. As a partial solution, check p_osrel of the quering process when selecting the size to report. Submitted by: Pawel Biernacki <pawel.biernacki@gmail.com> Differential revision: https://reviews.freebsd.org/D13018	2017-11-15 13:41:03 +00:00
kib	c05cd25802	Fix operator priority. Sponsored by: The FreeBSD Foundation	2017-11-08 23:25:05 +00:00
markj	32541ba3eb	Allow various page daemon parameters to be set from loader.conf. MFC after: 1 week	2017-11-08 19:55:17 +00:00
jeff	3c355d849c	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
markj	59bc046828	Correct the type of foff. No functional change intended. Github PR: 124 Submitted by: Wuyang Chung <wuyang.m.chung@outlook.com> MFC after: 1 week	2017-11-08 01:53:03 +00:00
alc	1728bf4480	Micro-optimize the handling of fictitious pages in vm_page_free_prep(). A fictitious page is always wired, so there is no point in trying to remove one from the page queues. Completely remove one inaccurate comment from vm_page_free_prep() and correct another. Reviewed by: kib, markj MFC after: 1 week	2017-10-24 17:14:53 +00:00
trasz	b1c2085f71	Add OID for the vm.overcommit sysctl. This makes it possible to remove one call to sysctl(2) from jemalloc startup code. (That also requires changes to jemalloc, but I plan to push those to upstream first.) Reviewed by: kib MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D12745	2017-10-22 10:35:29 +00:00
kib	09c8e869a8	Check that the page which is freed as zeroed, indeed has all-zero content. This catches some rare mysterious failures at the source. The check is only performed on architectures which implement direct map, and only enabled with option DIAGNOSTIC, similar to other costly consistency checks. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-10-21 17:28:12 +00:00
markj	042c6a8f6c	Free the right address range if kmem_back() fails in memguard_alloc(). MFC after: 1 week Sponsored by: Dell EMC Isilon	2017-10-20 21:13:19 +00:00
kib	3b0b7e2626	Take the vm object lock in read mode in vnode_generic_putpages(). Only upgrade it to write mode if we need to clear dirty bits of the partially valid page after EOF. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2017-10-20 18:40:29 +00:00
kib	20d037559d	Move swapout code into vm/vm_swapout.c. There is no NO_SWAPPING #ifdef left in the code. Requested by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D12663	2017-10-20 09:10:49 +00:00
kib	73e703b999	Do not overwrite clean blocks on pageout. If filesystem block size is less than the page size, it is possible that the page-out run contains partially clean pages. E.g., the chunk of the page might be bdwrite()-ed, or some thread performed bwrite() on a buffer which references a chunk of the paged out page. As result, the assertion added in r319975, which checked that all pages in the run are dirty, does not hold on such filesystems. One solution is to remove the assert, but it is undesirable, because we do overwrite the valid on-disk content. I cannot provide a scenario where such write would corrupt the file data, but I do not like it on principle. Another, in my opinion proper, solution is to only write parts of the pages still marked dirty. The patch implements this, it skips clean blocks and only writes the dirty block runs. Note that due to clustering, write one page might clean other pages in the run, so the next write range must be calculated only after the current range is written out. More, due to a possible invalidation, and the fact that the object lock is dropped and reacquired before the checks, it is possible that the whole page-out pages run appears to consist of only clean pages. For this reason, it is impossible to assert that there is some work for the pageout method to do (i.e. assert that there is at least one dirty page in the run). But such clearing can only occur due to invalidation, and not due to a parallel write, because we own the vnode lock exclusive. Reported by: fsu In collaboration with: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D12668	2017-10-20 08:32:37 +00:00
kib	9a4db25ceb	In vm_page_free_phys_pglist(), do not take vm_page_queue_free_mtx if there is nothing to do. Suggested by: mjg Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-20 08:25:49 +00:00
alc	6dd81762df	Batch atomic updates to the number of active, inactive, and laundry pages by vm_object_terminate_pages(). For example, for a "buildworld" workload, this batching reduces vm_object_terminate_pages()'s average execution time by 12%. (The total savings were about 11.7 billion processor cycles.) Reviewed by: kib MFC after: 1 week	2017-10-19 04:13:47 +00:00
kib	b90468ae70	Do not report reduction of swap zone if it was not. After r324600 we see the actual reservation. Reported by: jkim Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-18 07:27:43 +00:00
mjg	03b10745ce	Reduce traffic on vm_cnt.v_free_count The variable is modified with the highly contended page free queue lock. It unnecessarily shares a cacheline with purely read-only fields and is re-read after the lock is dropped in the page allocation code making the hold time longer. Pad the variable just like the others and store the value as found with the lock held instead of re-reading. Provides a modest 1%-ish speed up in concurrent page faults. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D12665	2017-10-13 21:54:34 +00:00
kib	7f82035277	Evaluate the real size of the sblk_zone. Submitted by: ota@j.email.ne.jp PR: 221356 Reviewed by: alc, markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D12660	2017-10-13 16:23:05 +00:00
emaste	951751c6c6	ANSIfy vm_kern.c PR: 222673 Submitted by: ota@j.email.ne.jp MFC after: 1 week	2017-10-13 13:53:19 +00:00
alc	1615de1f68	Replace an unnecessary call to vm_page_activate() by an assertion that the page is already wired or queued. Prior to the elimination of PG_CACHED pages, vm_page_grab() might have returned a valid, previously PG_CACHED page, in which case enqueueing the page was necessary. Now, that can't happen. Moreover, activating the page is a dubious choice, since the page is not being accessed. Reviewed by: kib MFC after: 1 week	2017-10-08 16:54:42 +00:00
alc	b88a843faf	When an I/O error occurs on page out, there is no need to dirty the page, because it is already dirty. Instead, assert that the page is dirty. Reviewed by: kib, markj MFC after: 1 week	2017-10-01 17:04:26 +00:00
alc	7ad59282da	Optimize vm_object_page_remove() by eliminating pointless calls to pmap_remove_all(). If the object to which a page belongs has no references, then that page cannot possibly be mapped. Reviewed by: kib MFC after: 1 week	2017-09-28 17:55:41 +00:00
jhb	13b1e2684d	Add UMA_ALIGNOF(). This is a wrapper around _Alignof() that sets the alignment for a zone to the alignment required by a given type. This allows the compiler to determine the proper alignment rather than having the programmer try to guess. Discussed on: arch@ MFC after: 1 week Sponsored by: DARPA / AFRL	2017-09-27 23:15:33 +00:00
alc	ec44341cad	Change vm_page_try_to_free() to require a managed page. Essentially, vm_page_try_to_free() is testing conditions, like clean versus dirty, that only vary in managed pages. Suggested by: kib Reviewed by: markj X-MFC after: never	2017-09-24 23:35:01 +00:00
alc	07fc017373	Optimize vm_page_try_to_free(). Specifically, the call to pmap_remove_all() can be avoided when the page's containing object has a reference count of zero. (If the object has a reference count of zero, then none of its pages can possibly be mapped.) Address nearby style issues in vm_page_try_to_free(), and change its return type to "bool". Reviewed by: kib, markj MFC after: 1 week	2017-09-24 16:50:10 +00:00
kib	23d65de60e	For unlinked files, do not msync(2) or sync on the vnode deactivation. One consequence of the patch is that msyncing unlinked file mappings no longer reduces the amount of the dirty memory in the system, but I do not think that there are users of msync(2) that utilize it for such side-effect. Reported and tested by: tjil PR: 222356 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12411	2017-09-19 16:46:37 +00:00
kib	4c4cbd798b	Batch freeing of the pages in vm_object_page_remove() under the same free queue mutex lock owning session, same as it was done for the object termination in r323561. Reported and tested by: mjg Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-15 16:07:09 +00:00
markj	d0e4a8e28e	Include _bitset.h to get BITSET_DEFINE, used to define struct slabbits. MFC after: 1 week	2017-09-15 14:59:35 +00:00
markj	7d8d6899fc	Widen uk_pgoff, the slab header offset field. 16 bits is only wide enough for kegs with an item size of up to 64KB. At that size or larger, slab headers are typically offpage because the item size is a multiple of the page size, but there is no requirement that this be the case. We can widen the field without affecting the layout of struct uma_keg since the removal of uk_slabsize in r315077 left an adjacent hole. PR: 218911 MFC after: 2 weeks	2017-09-13 21:54:37 +00:00
kib	32b600317c	Remove inline specifier from vm_page_free_wakeup(), do not micro-manage compiler. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:30:09 +00:00
kib	9c0adbf36e	Do not relock free queue mutex for each page, free whole terminating object' page queue under the single mutex lock. First, all pages on the queue are prepared for free by calls to vm_page_free_prep(), and pages which should not be returned to the physical allocator (e.g. wired or fictitious) are simply removed from the queue. On the second pass, vm_page_free_phys_pglist() inserts all pages from the queue without relocking the mutex. The change improves the object termination, e.g. on the process exit where large anonymous memory objects otherwise cause relocks the free queue mutex for each page. More, if several such processes are exiting or execing in parallel, the mutex was highly contended on the address space demolition. Diagnosed and tested by: mjg (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:22:07 +00:00
kib	b938027f01	Split vm_page_free_toq() into two parts, preparation vm_page_free_prep() and insertion into the phys allocator free queues vm_page_free_phys(). Also provide a wrapper vm_page_free_phys_pglist() for batched free. Reviewed by: alc, markj Tested by: mjg (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:11:52 +00:00
kib	7e40933eb7	Use existing tag name for the vm_object' memq. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:03:59 +00:00
markj	60fea4eba0	Fix a logic error in the item size calculation for internal UMA zones. Kegs for internal zones always keep the slab header in the slab itself. Therefore, when determining the allocation size, we need to take the slab header size into account. Reported and tested by: ae, rakuco Reviewed by: avg MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12342	2017-09-13 15:44:54 +00:00
mjg	7495dfa6d7	Move vmmeter atomic counters into dedicated cache lines Prior to the change they were subject to extreme false sharing. In particular this change shaves about 3 seconds real time of -j 80 buildkernel. Reviewed by: alc, markj Differential Revision: https://reviews.freebsd.org/D12281	2017-09-10 19:00:38 +00:00
alc	e7430120e9	To analyze the allocation of swap blocks by blist functions, add a method for analyzing the radix tree structures and reporting on the number, and sizes, of maximal intervals of free blocks. The report includes the number of maximal intervals, and also the number of them in each of several size ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size boundaries defined by Fibonacci numbers. The report is written in the test tool with the 's' command, or in a running kernel by sysctl. The analysis of the radix tree frequently computes the position of the lone bit set in a u_daddr_t, a computation that also appears in leaf allocation. That computation has been moved into a function of its own, and optimized for cases where an inlined machine instruction can replace the usual binary search. Submitted by: Doug Moore <dougm@rice.edu> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11906	2017-09-10 17:46:03 +00:00
kib	d27f2b11c2	Add a vm_page_change_lock() helper, the common code to not relock page lock if both old and new pages use the same underlying lock. Convert existing places to use the helper instead of inlining it. Use the optimization in vm_object_page_remove(). Suggested and reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-09 17:35:19 +00:00
markj	1db0a22947	Speed up vm_page_array initialization. We currently initialize the vm_page array in three passes: one to zero the array, one to initialize the "order" field of each page (necessary when inserting them into the vm_phys buddy allocator one-by-one), and one to initialize the remaining non-zero fields and individually insert each page into the allocator. Merge the three passes into one following a suggestion from alc: initialize vm_page fields in a single pass, and use vm_phys_free_contig() to efficiently insert physical memory segments into the buddy allocator. This reduces the initialization time to a third or a quarter of what it was before on most systems that I tested. Reviewed by: alc, kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D12248	2017-09-07 21:43:39 +00:00
mjg	4cc87bd651	Start annotating global _padalign locks with __exclusive_cache_line While these locks are guarnteed to not share their respective cache lines, their current placement leaves unnecessary holes in lines which preceeded them. For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour cachelines (previously separate by the lock) to be collapsed into 1. The annotation is only effective on architectures which have it implemented in their linker script (currently only amd64). Thus locks are not converted to their not-padaligned variants as to not affect the rest. MFC after: 1 week	2017-09-06 20:28:18 +00:00
kib	6fc3cacfc6	Do not leak empty swblk. In swp_pager_meta_build(), if the requested operation results in freeing the last swap pointer in the swblk, free the trie node. Other swap pager code does not expect to find completely empty swblk. Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-06 16:18:53 +00:00
kib	100162ab54	In swp_pager_meta_build(), handle a race with other thread allocating swapblk for our index while we dropped the object lock. Noted by: jeff Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-06 16:16:11 +00:00
kib	13aa7b3b56	Adjust interface of swapon_check_swzone() to its actual usage. The function return value is not used. Its argument is always swap_total/PAGE_SIZE, so make it not take any arguments. Submitted by: ota@j.email.ne.jp PR: 221356 MFC after: 1 week	2017-08-30 10:17:00 +00:00
kib	9fa4cbc3e4	Make the swap_pager_full variable static. r290920 removed the use of the variable from vm/vm_pageout.c. Submitted by: ota@j.email.ne.jp PR: 221356 MFC after: 1 week	2017-08-30 09:44:05 +00:00
markj	f3e8cdafb5	Synchronize page laundering with pmap_extract_and_hold(). Before r207410, the hold count of a page in a page queue was protected by the queue lock, and, before laundering a page, the page daemon removed managed writeable mappings of the page before releasing the queue lock. This ensured that other threads could not concurrently create transient writeable mappings using pmap_extract_and_hold() on a user map, as is done for example by vmapbuf(). With that revision, however, a race can allow the creation of such a mapping, meaning that the page might be modified as it is being laundered, potentially resulting in it being marked clean when its contents do not match those given to the pager. Close the race by using the page lock to synchronize the hold count check in vm_pageout_cluster() with the removal of writeable managed mappings. Reported by: alc Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12084	2017-08-28 22:10:15 +00:00
alc	a492a3f631	Update a couple vm_object lock assertions in the swap pager to reflect the new use of the vm_object's lock to synchronize updates to a radix trie mapping per-vm object page indices to on-disk swap blocks. Fix a typo in a nearby comment. Reviewed by: kib, markj X-MFC with: r322913 Differential Revision: https://reviews.freebsd.org/D12134	2017-08-28 17:02:25 +00:00
alc	4ab94b03c3	Switching from a global hash table to per-vm_object radix tries for mapping vm_object page indices to on-disk swap space (r322913) has changed the synchronization requirements for a couple swap pager functions. Whereas before a read lock on the vm object sufficed because of the global mutex on the hash table, a write lock on the vm object may now be required. In particular, calls to vm_pager_page_unswapped() now require a write lock on the vm_object. Consequently, vm_fault()'s fast path cannot call vm_pager_page_unswapped(). The swap space will have to be released at a later point. Reviewed by: kib, markj X-MFC with: r322913 Differential Revision: https://reviews.freebsd.org/D12134	2017-08-28 16:55:43 +00:00
kib	5ed2d3f017	Replace global swhash in swap pager with per-object trie to track swap blocks assigned to the object pages. - The global swhash_mtx is removed, trie is synchronized by the corresponding object lock. - The swp_pager_meta_free_all() function used during object termination is optimized by only looking at the trie instead of having to search whole hash for the swap blocks owned by the object. - On swap_pager_swapoff(), instead of iterating over the swhash, global object list have to be inspected. There, we have to ensure that we do see valid trie content if we see that the object type is swap. Sizing of the swblk zone is same as for swblock zone, each swblk maps SWAP_META_PAGES pages. Proposed by: alc Reviewed by: alc, markj (previous version) Tested by: alc, pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D11435	2017-08-25 23:13:21 +00:00
br	bc66f23e16	Add OBJ_PG_DTOR flag to VM object. Setting this flag allows us to skip pages removal from VM object queue during object termination and to leave that for cdev_pg_dtor function. Move pages removal code to separate function vm_object_terminate_pages() as comments does not survive indentation. This will be required for Intel SGX support where we will have to remove pages from VM object manually. Reviewed by: kib, alc Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11688	2017-08-16 08:49:11 +00:00
markj	91777b5ef2	Add vm_page_alloc_after(). This is a variant of vm_page_alloc() which accepts an additional parameter: the page in the object with largest index that is smaller than the requested index. vm_page_alloc() finds this page using a lookup in the object's radix tree, but in some cases its identity is already known, allowing the lookup to be elided. Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after(). vm_page_alloc() is converted into a trivial wrapper of vm_page_alloc_after(). Suggested by: alc Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11984	2017-08-15 16:39:49 +00:00
markj	6f4724899b	Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT. This will allow its use in sendfile_swapin(). Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11942	2017-08-11 16:29:22 +00:00
markj	865cb1c60b	Micro-optimize kmem_unback(). We can remove some unnecessary object radix tree lookups by using the object memq to iterate over pages in the specified range. This does not, however, eliminate the lookup needed in vm_page_free_toq() to remove each tree entry. Reviewed by: alc, kib (previous revision) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11945	2017-08-11 03:09:11 +00:00
markj	a65dc108f4	Make vm_page_sunbusy() assert that the page is unlocked. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11946	2017-08-10 22:43:38 +00:00
alc	318304a5b7	Introduce vm_page_grab_pages(), which is intended to replace loops calling vm_page_grab() on consecutive page indices. Besides simplifying the code in the caller, vm_page_grab_pages() allows for batching optimizations. For example, the current implementation replaces calls to vm_page_lookup() on consecutive page indices by cheaper calls to vm_page_next(). Reviewed by: kib, markj Tested by: pho (an earlier version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11926	2017-08-09 04:23:04 +00:00
kib	8fd2052d87	Mark pages after EOF as clean after pageout. Suppose that a file on NFS has partially filled last page, and this page is dirty. NFS VOP_PAGEOUT() method only marks the the page clean up to the block of the last written byte, leaving other blocks dirty. Also any page which erronously exists in the vnode vm_object past EOF is also left marked as dirty. With the introduction of the buf-cache coherent pager, each pass of syncer over the object with such page results in creation of B_DELWRI buffer due to VOP_WRITE() call. This buffer is noted on next syncer pass, which results e.g. a visible manifestation of shutdown never finishing vnode sync. Note that before buf-cache coherency commit, a dirty page might left never synced to server if a partial writes occur. Fix this by clearing dirty bits after EOF. Only blocks of the partial page which are completely after EOF are marked clean, to avoid possible user data loss. Reported by: mav Reviewed by: alc, markj Tested by: mav, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11697	2017-07-26 20:07:05 +00:00
alc	65ce87ce40	Address a compilation warning on some architectures that was introduced by the previous change, r321386. Reported by: ian MFC after: 10 days X-MFC after: r321386	2017-07-23 19:35:14 +00:00
alc	1179a1717e	Utilize pmap_enter(..., psind=1) in vm_fault_soft_fast() on amd64. (The Differential Revision discusses the benefits of this change.) Add a function, vm_reserv_to_superpage(), that returns the superpage containing the specified base page. Reviewed by: kib, markj Tested by: pho MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D11556	2017-07-23 16:28:13 +00:00
alc	621c14d3f9	Add support for pmap_enter(..., psind=1) to the amd64 pmap. In other words, add support for explicitly requesting that pmap_enter() create a 2MB page mapping. (Essentially, this feature allows the machine-independent layer to create superpage mappings preemptively, and not wait for automatic promotion to occur.) Export pmap_ps_enabled() to the machine-independent layer. Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or reclaim a PV entry when one is not available. Refactor pmap_enter_pde() into two functions, one by the same name, that is a general-purpose function for creating PDE PG_PS mappings, and another, pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only mappings for execve(2), mmap(2), and shmat(2). Submitted by: Yufeng Zhou <yz70@rice.edu> (an earlier version) Reviewed by: kib, markj Tested by: pho MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D11556	2017-07-23 06:33:58 +00:00
alc	dd41c931cf	In vm_page_ps_test(), always check that the base pages within the specified superpage all belong to the same object. To date, that check has not been needed, but upcoming changes require it. (See the Differential Revision.) Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11556	2017-07-23 05:54:56 +00:00
kib	e38ce8ac29	Do not allocate struct kinfo_vmobject on stack. Its size is 1184 bytes. Noted by: eugen Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-07-22 13:33:06 +00:00
br	3eebf9cc69	Fix style: change spaces to tabs. Sponsored by: DARPA, AFRL	2017-07-21 14:14:47 +00:00
kib	f16b72a2cc	Add pctrie_init() and vm_radix_init() to initialize generic pctrie and vm_radix trie. Existing vm_radix_init() function is renamed to vm_radix_zinit(). Inlines moved out of the _ headers. Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11661	2017-07-19 20:52:47 +00:00
kib	34e529e98b	Disable stack growth when accessed by AIO daemons. Commit message for r321173 incorrectly stated that the change disables automatic stack growth from the AIO daemons contexts, with explanation that this is currently prevents applying wrong resource limits. Fix this by actually disabling the growth. Noted by: alc Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-07-19 19:00:32 +00:00
kib	b62be2202a	Remove unused function swap_pager_isswapped(). Noted by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-07-19 17:28:46 +00:00
kib	65f611547e	Convert assertion that only vmspace owner grows the stack, into a check blocking grow from other processes accesses. Debugger may access stack grow area with ptrace(2). In this case, real state of the process is to not have the stack grown, which provides more accurate inspection. Technical reason to avoid the grow is to avoid applying wrong process (debugger) stack limit. This change also has a consequence of making aio workers accesses past the bottom of stacks into EFAULT, arguably the situation is a programmers mistake. Reported by: jhb Discussed with: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-07-18 20:26:41 +00:00
alc	80e6bc7371	Generalize vm_page_ps_is_valid() to support testing other predicates on the (super)page, renaming the function to vm_page_ps_test(). Reviewed by: kib, markj MFC after: 1 week	2017-07-14 02:15:48 +00:00
kib	d2c5c68142	Fix loop termination in vm_map_find_min(). Reported by: antoine Tested by: Stefan Ehmann <shoesoft@gmx.net>, Jan Kokemueller <jan.kokemueller@gmail.com> PR: 220493 Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-07-09 15:41:49 +00:00
alc	7c0d38a1e3	Modify vm_map_growstack() to protect itself from the possibility of the gap entry in the vm map being smaller than the sysctl-derived stack guard size. Otherwise, the value of max_grow can suffer from overflow, and the roundup(grow_amount, sgrowsiz) will not be properly capped, resulting in an assertion failure. In collaboration with: kib MFC after: 3 days	2017-07-01 23:39:49 +00:00
alc	0b28a56ef7	Clear the MAP_WIREFUTURE flag on the vm map in exec_new_vmspace() when it recycles the current vm space. Otherwise, an mlockall(MCL_FUTURE) could still be in effect on the process after an execve(2), which violates the specification for mlockall(2). It's pointless for vm_map_stack() to check the MEMLOCK limit. It will never be asked to wire the stack. Moreover, it doesn't even implement wiring of the stack. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11421	2017-06-30 15:49:36 +00:00
kib	91d820a4f9	Treat the addr argument for mmap(2) request without MAP_FIXED flag as a hint. Right now, for non-fixed mmap(2) calls, addr is de-facto interpreted as the absolute minimal address of the range where the mapping is created. The VA allocator only allocates in the range [addr, VM_MAXUSER_ADDRESS]. This is too restrictive, the mmap(2) call might unduly fail if there is no free addresses above addr but a lot of usable space below it. Lift this implementation limitation by allocating VA in two passes. First, try to allocate above addr, as before. If that fails, do the second pass with less restrictive constraints for the start of allocation by specifying minimal allocation address at the max bss end, if this limit is less than addr. One important case where this change makes a difference is the allocation of the stacks for new threads in libthr. Under some configuration conditions, libthr tries to hint kernel to reuse the main thread stack grow area for the new stacks. This cannot work by design now after grow area is converted to stack, and there is no unallocated VA above the main stack. Interpreting requested stack base address as the hint provides compatibility with old libthr and with (mis-)configured current libthr. Reviewed by: alc Tested by: dim (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-28 04:02:36 +00:00
kib	aa808958eb	For now, allow mprotect(2) over the guards to succeed regardless of the requested protection. The syscall returns success without changing the protection of the guard. This is consistent with the current mprotect(2) behaviour on the unmapped ranges. More important, the calls performed by libc and libthr to allow execution of stacks, if requested by the loaded ELF objects, do the expected change instead of failing on the grow space guard. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-25 23:16:37 +00:00
kib	a34566f9cc	Correctly handle small MAP_STACK requests. If mmap(2) is called with the MAP_STACK flag and the size which is less or equal to the initial stack mapping size plus guard, calculation of the mapping layout created zero-sized guard. Attempt to create such entry failed in vm_map_insert(), causing the whole mmap(2) call to fail. Fix it by adjusting the initial mapping size to have space for non-empty guard. Reject MAP_STACK requests which are shorter or equal to the configured guard pages size. Reported and tested by: Manfred Antar <null@pozo.com> Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-25 20:06:05 +00:00
kib	fd9763bf72	Remove stale part of the comment. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-25 19:59:39 +00:00
kib	20f226eed3	Style. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-25 18:40:59 +00:00
alc	63fd4c37fb	Increase the pageout cluster size to 32 pages. Decouple the pageout cluster size from the size of the hash table entry used by the swap pager for mapping (object, pindex) to a block on the swap device(s), and keep the size of a hash table entry at its current size. Eliminate a pointless macro. Reviewed by: kib, markj (an earlier version) MFC after: 4 weeks Differential Revision: https://reviews.freebsd.org/D11305	2017-06-24 17:10:33 +00:00
kib	34de63e92d	Implement address space guards. Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of the allocated address space, but does not allow instantiation of the pages in the range. It is useful for more explicit support for usual two-stage reserve then commit allocators, since it prevents accidental instantiation of the mapping, e.g. by mprotect(2). Use guards to reimplement stack grow code. Explicitely track stack grow area with the guard, including the stack guard page. On stack grow, trivial shift of the guard map entry and stack map entry limits makes the stack expansion. Move the code to detect stack grow and call vm_map_growstack(), from vm_fault() into vm_map_lookup(). As result, it is impossible to get random mapping to occur in the stack grow area, or to overlap the stack guard page. Enable stack guard page by default. Reviewed by: alc, markj Man page update reviewed by: alc, bjk, emaste, markj, pho Tested by: pho, Qualys Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11306 (man pages)	2017-06-24 17:01:11 +00:00
kib	eb2cba616a	Do not try to unmark MAP_ENTRY_IN_TRANSITION marked by other thread. The issue is catched by "vm_map_wire: alien wire" KASSERT at the end of the vm_map_wire(). We currently check for MAP_ENTRY_WIRE_SKIPPED flag before ensuring that the wiring_thread is curthread. For HOLESOK wiring, this means that we might see WIRE_SKIPPED entry from different wiring. The fix it by only checking WIRE_SKIPPED if the entry is put IN_TRANSITION by us. Also fixed a typo in the comment explaining the situation. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-24 16:47:41 +00:00
kib	3677635c7b	Call pmap_copy() only for map entries which have the backing object instantiated. Calling pmap_copy() on non-faulted anonymous memory entries is useless. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-21 18:54:28 +00:00
kib	1892b89561	Assert that the protection of a new map entry is a subset of the max protection. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-21 18:51:30 +00:00
alc	742947fba6	Eliminate an unused macro. MFC after: 3 days	2017-06-21 03:55:45 +00:00
kib	bdbf5f6e59	Ignore the P_SYSTEM process flag, and do not request VM_MAP_WIRE_SYSTEM mode when wiring the newly grown stack. System maps do not create auto-grown stack. Any stack we handled, even for P_SYSTEM, must be for user address space. P_SYSTEM processes with mapped user space is either init(8) or an aio worker attached to other user process with aio buffer pointing into stack area. In either case, VM_MAP_WIRE_USER mode should be used. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-19 20:40:59 +00:00
alc	3f2176677b	Pages that are passed to swap_pager_putpages() should already be fully dirty. Assert that they are fully dirty rather than redundantly calling vm_page_dirty() on them. Reviewed by: kib, markj MFC after: 1 week X-MFC after: r319932	2017-06-17 03:05:25 +00:00
kib	7dae9de7e8	Some minor improvements to vnode_pager_generic_putpages(). - Add asserts that the pages to write are dirty. The last page, if partially written, is only required to be dirty, while completely written pages should have all dirty bit set. - Use uintmax_t to print vm_page pindexes. - Use NULL instead of casted zero. - Remove if () test which duplicated the loop ending condition. - Miscellaneous style fixes. Reviewed by: alc, markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-15 14:34:33 +00:00
glebius	8e755a8e23	When we are in UMA_STARTUP use startup_alloc() for any zone, not for internal zones only. This allows to create new zones at early stages of boot, without need to mark them as internal to UMA, which isn't always true. Reviewed by: alc	2017-06-08 21:33:19 +00:00
jhb	2a188f8a23	Fix an off-by-one error in the VM page array on some systems. r31386 changed how the size of the VM page array was calculated to be less wasteful. For most systems, the amount of memory is divided by the overhead required by each page (a page of data plus a struct vm_page) to determine the maximum number of available pages. However, if the remainder for the first non-available page was at least a page of data (so that the only memory missing was a struct vm_page), this last page was left in phys_avail[] but was not allocated an entry in the VM page array. Handle this case by explicitly excluding the page from phys_avail[]. Reviewed by: alc Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D11000	2017-06-08 16:18:41 +00:00
alc	662afbe1bf	Starting in r118390, swaponsomething() began to reserve the blocks at the beginning of a swap area for a disk label. However, neither r118390 nor r118544, which increased the reservation from one to two blocks, correctly accounted for these blocks when updating the variable "swap_pager_avail". This change corrects that error. Reviewed by: kib MFC after: 5 days	2017-06-06 16:52:07 +00:00
alc	186046483d	When the function blist_fill() was added to the kernel in r107913, the swap pager used a different scheme for striping the allocation of swap space across multiple devices. And, although blist_fill() was intended to support fill operations with large counts, the old striping scheme never performed a fill larger than the stripe size. Consequently, the misplacement of a sanity check in blst_meta_fill() went undetected. Now, moving forward in time to r118390, a new scheme for striping was introduced that maintained a blist allocator per device, but as noted in r318995, swapoff_one() was not fully and correctly converted to the new scheme. This change completes what was started in r318995 by fixing the underlying bug in blst_meta_fill() that stops swapoff_one() from simply performing a single blist_fill() operation. Reviewed by: kib MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D11043	2017-06-06 03:32:17 +00:00
alc	e5e6fc9910	The variable "breakout" is used like a Boolean, so actually define it as one. Reviewed by: kib MFC after: 5 days	2017-06-05 18:07:56 +00:00
alc	ce8181e2f8	Halve the memory being internally allocated by the blist allocator. In short, half of the memory that is allocated to implement the radix tree is wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int when we changed "daddr_t" to be a 64-bit (signed) int. (See r96849 and r96851.) Reviewed by: kib, markj Tested by: pho MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D11028	2017-06-05 17:14:16 +00:00
glebius	18c82b0c32	As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs.	2017-06-01 18:36:52 +00:00
glebius	d20e709b83	Simplify boot pages management in UMA. It is simply a contigous virtual memory pointer and number of pages. There is no need to build a linked list here. Just increment pointer and decrement counter. The only functional difference to old allocator is that before we gave pages from topmost and down to lowest, and now we give them in normal ascending order. While here remove padalign from a mutex that is unused at runtime. Reviewed by: alc	2017-06-01 18:26:57 +00:00
alc	a1b59a2a3e	After r118390, the variable "dmmax" was neither the correct strip size nor the correct maximum block size. Moreover, after r318995, it serves no purpose except to provide information to user space through a read- sysctl. This change eliminates the variable "dmmax" but retains the sysctl. It also corrects the value returned by the sysctl. Reviewed by: kib, markj MFC after: 3 days	2017-05-27 21:46:00 +00:00
alc	33fb89b4ed	In r118390, the swap pager's approach to striping swap allocation over multiple devices was changed. However, swapoff_one() was not fully and correctly converted. In particular, with r118390's introduction of a per- device blist, the maximum swap block size, "dmmax", became irrelevant to swapoff_one()'s operation. Moreover, swapoff_one() was performing out-of- range operations on the per-device blist that were silently ignored by blist_fill(). This change corrects both of these problems with swapoff_one(), which will allow us to potentially increase MAX_PAGEOUT_CLUSTER. Previously, swapoff_one() would panic inside of blist_fill() if you increased MAX_PAGEOUT_CLUSTER. Reviewed by: kib, markj MFC after: 3 days	2017-05-27 16:40:00 +00:00
kib	e75ba1d5c4	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
kib	6b474b6405	Emulate pre-r317061 ABI. This restores 32bit-sized accesses to vmcnt sysctls, making old binaries like top(1), systat(8) and reboot(8) mostly functional on newer kernel. Reviewed by: bde Sponsored by: The FreeBSD Foundation	2017-05-02 18:40:41 +00:00
glebius	21ead51d79	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
glebius	5763443023	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
markj	c4c6e9ae09	Busy the map in vm_map_protect(). We are otherwise susceptible to a race with a concurrent vm_map_wire(), which may drop the map lock to fault pages into the object chain. In particular, vm_map_protect() will only copy newly writable wired pages into the top-level object when MAP_ENTRY_USER_WIRED is set, but vm_map_wire() only sets this flag after its fault loop. We may thus end up with a writable wired entry whose top-level object does not contain the entire range of pages. Reported and tested by: pho Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10349	2017-04-10 21:01:42 +00:00
markj	6976fe28f2	Consistently use for-loops in vm_map_protect(). No functional change. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon X-Differential Revision: https://reviews.freebsd.org/D10349	2017-04-10 20:57:16 +00:00
markj	278be1664c	Add some bounds assertions to the vm_map_entry clip functions. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon X-Differential Revision: https://reviews.freebsd.org/D10349	2017-04-10 20:55:42 +00:00
kib	2a5a630e5b	Extract calculation of ioflags from the vm_pager_putpages flags into a helper. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 16:56:04 +00:00
kib	dfb5980629	Some style fixes for vnode_pager_generic_putpages(), in the local declaration block. Reviewed by: markj (as part of the larger patch) Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 16:45:00 +00:00
kib	6e5de5b2c0	Use int instead of boolean_t for flags argument type in vnode_pager_generic_putpages() prototype; change the argument name to reflect that it is flags. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 16:30:41 +00:00
jhb	bbbd5839ff	Assert that the align parameter to uma_zcreate() is valid. Reviewed by: kib MFC after: 1 week Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D10100	2017-04-04 16:26:46 +00:00
dchagin	57bf038283	Add kern_mincore() helper for micore() syscall. Suggested by: kib@ Reviewed by: kib@ MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D10143	2017-03-30 19:42:49 +00:00
alc	8bad3c5e87	Two changes to vm_fault_populate(): Simplify the logic for clipping the range returned by the pager to fit within the map entry. Use atop() rather than OFF_TO_IDX() on addresses. Reviewed by: kib MFC after: 1 week	2017-03-19 19:52:47 +00:00
kib	340b707be8	Fix off-by-one in the vm_fault_populate() code. When re-calculating the last inclusive page index after the pager call, -1 was erronously ommitted. If the pager extended the run (unlikely), the result would be insertion of the valid page mapping outside the current map entry range. Found by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-19 14:42:16 +00:00
delphij	d308d4fa6c	The adj_free and max_free values of new_entry will be calculated and assigned by subsequent vm_map_entry_link(), therefore, remove the pointless copying. Submitted by: alc MFC after: 3 days	2017-03-16 05:44:16 +00:00
alc	89e63829ad	Relax the locking requirements for vm_object_page_noreuse(). While reviewing all uses of OFF_TO_IDX(), I observed that vm_object_page_noreuse() is requiring an exclusive lock on the object when, in fact, a shared lock suffices. Reviewed by: kib, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D10011	2017-03-15 17:43:45 +00:00
kib	8cf2af841c	Use atop() instead of OFF_TO_IDX() for convertion of addresses or addresses offsets, as intended. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-03-14 19:39:17 +00:00
delphij	6c1ff20b8d	Implement INHERIT_ZERO for minherit(2). INHERIT_ZERO is an OpenBSD feature. When a page is marked as such, it would be zeroed upon fork(). This would be used in new arc4random(3) functions. PR: 182610 Reviewed by: kib (earlier version) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D427	2017-03-14 17:10:42 +00:00
markj	b6e7d5b313	Update a comment to reflect reality. MFC after: 1 week	2017-03-13 18:45:25 +00:00
kib	dada1184e8	Follow-up to r313690. Fix two missed places where vm_object offset to index calculation should use unsigned shift, to allow handling of full range of unsigned offsets used to create device mappings. Reported and tested by: royger (previous version) Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-12 13:53:13 +00:00
avg	b0aaa6df92	uma: fix pages <-> items conversions at several places Those places were not taking into account uk_ppera. At present one allocation is always used by one slab, so uk_ppera must be used to convert between pages and slabs. uk_ipers is used to convert between slabs and items. MFC after: 1 month (if ever)	2017-03-11 16:43:38 +00:00
avg	de71736cdc	uma: eliminate uk_slabsize field The field was not used beyond the initial keg setup stage anyway. MFC after: 1 month (if ever)	2017-03-11 16:35:36 +00:00
imp	7e6cabd06e	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
avg	e4abb6a4a9	call vm_lowmem hook in uma_reclaim_worker A comment near kmem_reclaim() implies that we already did that. Calling the hook is useful, because some handlers, e.g. ARC, might be able to release significant amounts of KVA. Now that we have more than one place where vm_lowmem hook is called, use this change as an opportunity to introduce flags that describe a reason for calling the hook. No handler makes use of the flags yet. Reviewed by: markj, kib MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9764	2017-02-25 16:39:21 +00:00
kib	568d99bbad	Properly handle possible underflow in vm_fault_prefault(). In vm_fault_prefault(), if backward count causes underflow in calculation of starta = addra - backward * PAGE_SIZE; then starta must be clipped to entry->start, instead of zero. Clipping to zero allowed mapping outside of the map entries address ranges, in particular, map at zero. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: alc MFC after: 1 week	2017-02-24 08:09:16 +00:00
avg	25c75ccb77	try to fix RACCT_RSS accounting There could be a race between the vm daemon setting RACCT_RSS based on the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to zero. In that case we can get a zombie process with non-zero RACCT_RSS. If the process is jailed, that may break accounting for the jail. There could be other consequences. Fix this race in the vm daemon by updating RACCT_RSS only when a process is in the normal state. Also, make accounting a little bit more accurate by refreshing the page resident count after calling vm_pageout_map_deactivate_pages(). Finally, add an assert that the RSS is zero when a process is reaped. PR: 210315 Reviewed by: trasz Differential Revision: https://reviews.freebsd.org/D9464	2017-02-14 13:54:05 +00:00
bz	21d1178863	Use %s __func__ to print the actual function name (been looking at the wrong one for too often lately at first), and also use %#lx to get the 0x prefix for the address. MFC after: 1 week	2017-02-14 01:20:03 +00:00
kib	0dac2c5955	Rework r313352. Rename kern_vm_* functions to kern_*. Move the prototypes to syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as needed, to avoid headers pollution. Requested by: alc, jhb Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D9535	2017-02-13 09:04:38 +00:00
kib	0bc2784093	Remove MPSAFE and ARGUSED annotations, ANSI-fy syscall handlers. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-02-13 00:40:55 +00:00
kib	9bfd07fd2e	Consistently handle negative or wrapping offsets in the mmap(2) syscalls. For regular files and posix shared memory, POSIX requires that [offset, offset + size) range is legitimate. At the maping time, check that offset is not negative. Allowing negative offsets might expose the data that filesystem put into vm_object for internal use, esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies that the mapped range is valid, assuming that mmap(2) checked that arithmetic gives no undefined results. For device mappings, leave the semantic of negative offsets to the driver. Correct object page index calculation to not erronously propagate sign. In either case, disallow overflow of offset + size. Update mmap(2) man page to explain the requirement of the range validity, and behaviour when the range becomes invalid after mapping. Reported and tested by: royger (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-12 21:05:44 +00:00
kib	95e32f2845	Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int. This makes the code to pass whole word of the mmap(2) syscall argument prot to the syscall helper kern_vm_mmap(), which can validate all bits. The change provides temporal fix for sys/vm/mmap_test mmap__bad_arguments, which was broken after r313352. PR: 216976 Reported and tested by: ngie Sponsored by: The FreeBSD Foundation	2017-02-11 20:27:39 +00:00
trasz	322d98e692	Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(), kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats instead of their sys_*() counterparts. Reviewed by: ed, dchagin, kib MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9378	2017-02-06 20:57:12 +00:00
kib	cf399915fa	Style, use tab after #define. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-02-04 19:16:19 +00:00
alc	62fe63cfce	Over the years, the code and comments in vm_page_startup() have diverged in one respect. When determining how many page structures to allocate, contrary to what the comments say, the code does not account for the overhead of a page structure per page of physical memory. This revision changes the code to match the comments. Reviewed by: kib, markj MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D9081	2017-02-04 05:23:10 +00:00
trasz	09059e926a	Ifdef out the unused vm_rr_selectdomain(). MFC after: 2 weeks Sponsored by: DARPA, AFRL	2017-02-02 17:44:55 +00:00
markj	dfb4f0694a	Avoid page lookups in the top-level object in vm_object_madvise(). We can iterate over consecutive resident pages in the top-level object using the object's page list rather than by performing lookups in the object radix tree. This extends one of the optimizations in r312208 to the case where a shadow chain is present. Suggested by: alc Reviewed by: alc, kib (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9282	2017-01-30 18:51:43 +00:00
mjg	37825433d7	hwpmc: partially depessimize munmap handling if the module is not loaded HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the common (module not loaded) case. In particular this avoids permission checks + lock downgrade singlethreaded and in cases were an executable mapping is found the pmc sx lock is no longer bounced. Note this is a band aid. MFC after: 1 week	2017-01-24 22:00:16 +00:00
markj	6543297f0e	Avoid unnecessary page lookups in vm_object_madvise(). vm_object_madvise() is frequently used to apply advice to a contiguous set of pages in an object with no backing object. Optimize this case by skipping non-resident subranges in constant time, and by iterating over resident pages using the object memq, thus avoiding radix tree lookups on each page index in the specified range. While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the "advise" parameter to vm_object_madvise() to "advice." Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9098	2017-01-15 03:50:08 +00:00
glebius	232949ffad	Fix the contiguity once more.	2017-01-12 20:26:02 +00:00
markj	a43ae72c82	Remove a redundant use of min(). Reported by: rpokala X-MFC With: r311346	2017-01-05 03:13:45 +00:00
markj	d0b3c7cb46	Add a small allocator for exec_map entries. Upon each execve, we allocate a KVA range for use in copying data to the new image. Pages must be faulted into the range, and when the range is freed, the backing pages are freed and their mappings are destroyed. This is a lot of needless overhead, and the exec_map management becomes a bottleneck when many CPUs are executing execve concurrently. Moreover, the number of available ranges is fixed at 16, which is insufficient on large systems and potentially excessive on 32-bit systems. The new allocator reduces overhead by making exec_map allocations persistent. When a range is freed, pages backing the range are marked clean and made easy to reclaim. With this change, the exec_map is sized based on the number of CPUs. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8921	2017-01-05 01:44:12 +00:00
glebius	53ba35ff94	Fix assertion that checks that pages are consecutive to properly handle bogus_page insertion(s).	2017-01-04 22:31:09 +00:00
glebius	01e1e94c27	Move bogus_page declaration to vm_page.h and initialization to vm_page.c. Reviewed by: kib	2017-01-04 22:27:19 +00:00
markj	bb34e01cc2	Add a page queue for holding dirty anonymous unswappable pages. On systems without a configured swap device, an attempt to launder pages from a swap object will always fail and result in the page being reactivated. This means that the page daemon will continuously scan pages that can never be evicted. With this change, anonymous pages are instead moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device is configured, so unreferenced unswappable pages are excluded from the page daemon's workload. Reviewed by: alc	2017-01-03 00:05:44 +00:00
jhibbits	d63657bd9d	Print flags in hex instead of decimal. Hex is easier to grok for flags, and consistent with other prints.	2017-01-02 16:50:52 +00:00
kib	e345e3955b	Style fixes for vm_map_insert(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-01 18:49:46 +00:00
kib	d67c4de4d0	Ansify vm/vm_pager.c. Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-31 19:30:22 +00:00
mjg	2f0eb1a5cb	Use vrefact in vnode_pager_alloc.	2016-12-31 10:37:56 +00:00
kib	80cfc0c41c	Fix two similar bugs in the populate vm_fault() code. If pager' populate method succeeded, but other thread raced with us and modified vm_map, we must unbusy all pages busied by the pager, before we retry the whole fault handling. If pager instantiated more pages than fit into the current map entry, we must unbusy the pages which are clipped. Also do some refactoring, clarify comments and use more clear local variable names. Reported and tested by: kargl, subbsd@gmail.com (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-30 18:55:33 +00:00
kib	8916f2e3c9	Assert that the pages found on the object queue by vm_page_next() and vm_page_prev() have correct ownership. In collaboration with: alc Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2016-12-30 17:37:06 +00:00
kib	9763564a8d	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-30 13:04:43 +00:00
mjg	f4dcd1882e	Remove cpu_spinwait after seq_consistent. It does not add any benefit as the read routine will do it as necessary.	2016-12-30 06:26:17 +00:00
alc	d619a90d54	Relax the object type restrictions on vm_page_alloc_contig(). Specifically, add support for object types that were previously prohibited because they could contain PG_CACHED pages. Roughly halve the number of radix trie operations performed by vm_page_alloc_contig() using the same approach that is employed by vm_page_alloc(). Also, eliminate the radix trie lookup performed with the free page queues lock held. Tidy up the handling of radix trie insert failures in vm_page_alloc() and vm_page_alloc_contig(). Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8878	2016-12-28 18:32:13 +00:00
kib	02ea7af4a5	Remove redundancy in vmtotal(). There are two instances of inlined unlocks + continue in vmtotal() switch statements, which are ordinary expressed with break from the switch case and code after the switch. Also, the combination of continue and break statement is redundand. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-26 19:29:04 +00:00
kib	da005e5578	Fix argument type and microoptimize swp_pager_meta_free(). The count argument natural type if vm_pindex_t, but due to the loop organization, it has to be signed type to detect the termination condition. Replace this logic by using distinguished counter for the processed pages, and terminate loop when the counter exceeds the argument. Completely process one swblock for all relevant indexes instead of doing relookup in hash when incrementing page index on the loop step. Do not drop hash mutex around iterations. Noted and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-24 09:57:31 +00:00
kib	e2725e89e1	Improve vm_object_scan_all_shadowed() to also check swap backing objects. As noted in the removed comment, it is possible and not prohibitively costly to look up the swap blocks for the given page index. Implement a swap_pager_find_least() function to do that, and use it to iterate simultaneously over both backing object page queue and swap allocations when looking for shadowed pages. Testing shows that number of new succesful scans, enabled by this addition, is small but non-zero. When worked out, the change both further reduces the depth of the shadow object chain, and frees unused but allocated swap and memory. Suggested and reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-18 20:56:14 +00:00
kib	a3cbc7800b	In swp_pager_meta_free_all(), fix type of the index variable. Style. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-16 23:33:37 +00:00
kib	565e916002	Provide introductory description of the default pager. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-14 23:36:32 +00:00
kib	3bda92f604	Remove locking around accounting initialization of the default object. The object is not yet fully constructed and must not be available to other threads. This makes default_pager_alloc() almost identical to swap_pager_alloc_init(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-14 23:34:25 +00:00
alc	d4c03e3af2	Tidy up. Mostly, remove or replace stale comments. Most of the comments in this file actually described the operation of the swap pager, not the default pager. Given that this is the wrong place to discuss the implementation of the swap pager, it shouldn't come as a surprise that as the swap pager evolved these comments became increasingly stale. In addition, apply some style fixes, like modernizing a few remaining old- style function definitions. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D8781	2016-12-14 17:28:55 +00:00
jhb	6b0130d5b3	Use db_lookup_proc() in the DDB 'show procvm' command. This allows processes to be identified by PID as well as a pointer address. MFC after: 2 weeks Sponsored by: DARPA / AFRL	2016-12-13 19:22:43 +00:00
alc	023ed8da8d	Eliminate every mention of PG_CACHED pages from the comments in the machine- independent layer of the virtual memory system. Update some of the nearby comments to eliminate redundancy and improve clarity. In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per The Chicago Manual of Style. Update the comment in vm/vm_page.h defining the four types of page queues to reflect the elimination of PG_CACHED pages and the introduction of the laundry queue. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8752	2016-12-12 17:47:09 +00:00
glebius	6042079e9d	Allow bogus_page to be passed to pager(s).	2016-12-09 21:21:24 +00:00
markj	b0a8bb2436	Conditionalize PG_CACHE sysctls on COMPAT_FREEBSD11. Reviewed by: glebius, imp, jhb Differential Revision: https://reviews.freebsd.org/D8736	2016-12-09 18:55:27 +00:00
kib	fd9c8a16f1	Implement the populate() pager method for phys pager. It allows to provide configurable agressive prefaulting and useful hints to page daemon about memory allocations, on faults for pages managed by phys pager. In fact, this implementation is superior to the MAP_SHARED_PHYS hack from my Postgresql paper, while giving similar benefits of reducing the page faults numbers on SysV shared memory mappings. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2016-12-08 11:35:53 +00:00
kib	2819db13ca	Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2016-12-08 11:26:11 +00:00
kib	a497cae7b6	Move map_generation snapshot value into struct faultstate. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-08 10:29:41 +00:00
kib	5671ac5680	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-08 10:28:51 +00:00
alc	7571ef95c1	Previously, vm_radix_remove() would panic if the radix trie didn't contain a vm_page_t at the specified index. However, with this change, vm_radix_remove() no longer panics. Instead, it returns NULL if there is no vm_page_t at the specified index. Otherwise, it returns the vm_page_t. The motivation for this change is that it simplifies the use of radix tries in the amd64, arm64, and i386 pmap implementations. Instead of performing a lookup before every remove, the pmap can simply perform the remove. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D8708	2016-12-08 04:29:29 +00:00
markj	8fa0e5ab2d	Use the official spelling for NULL arguments to typed sysctl handlers. Reported by: bde	2016-12-07 01:15:10 +00:00
markj	376b886316	Provide dummy sysctls for v_cache_count and v_tcached. Some utilities (notably top(1)) exit if any of their input sysctls don't exist, and the removal of the above-mentioned PG_CACHE-related sysctls makes it difficult to run such utilities on different versions of the kernel without recompiling. Requested by: bde	2016-12-06 22:52:45 +00:00
alc	1584fca18f	Eliminate a stale comment; vm_radix_prealloc() was replaced in r254141. MFC after: 3 days	2016-12-02 16:29:30 +00:00
alc	93bd5b27d2	During vm_page_cache()'s call to vm_radix_insert(), if vm_page_alloc() was called to allocate a new page of radix trie nodes, there could be a call to vm_radix_remove() on the same trie (of PG_CACHED pages) as the in-progress vm_radix_insert(). With the removal of PG_CACHED pages, we can simplify vm_radix_insert() and vm_radix_remove() by removing the flags on the root of the trie that were used to detect this case and the code for restarting vm_radix_insert() when it happened. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8664	2016-12-01 17:26:37 +00:00
alc	d95f7c7503	Recursion on the free page queue mutex occurred when UMA needed to allocate a new page of radix trie nodes to complete a vm_radix_insert() operation that was requested by vm_page_cache(). Specifically, vm_page_cache() already held the free page queue lock when UMA tried to acquire it through a call to vm_page_alloc(). This code path no longer exists, so there is no longer any reason to allow recursion on the free page queue mutex. Improve nearby comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8628	2016-11-27 01:42:53 +00:00
markj	4159d33f6b	Release laundered vnode pages to the head of the inactive queue. The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D8589	2016-11-23 17:53:07 +00:00
alc	4be9876033	Remove PG_CACHED-related fields from struct vmmeter, because they are no longer used. More precisely, they are always zero because the code that decremented and incremented them no longer exists. Bump __FreeBSD_version to mark this change. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8583	2016-11-22 18:13:46 +00:00
glebius	1cea9b7cc2	- If caller specifies readbehind and readahead that together with count doesn't fit into a buf, then trim readbehind and readahead evenly. If rbehind was limited by the previous BMAP, then roundup its trim to block size. - Add KASSERT to check that b_blkno has proper offset from original blkno returned by BMAP. [1] - Add KASSERT to check that pages in buf are consecutive. Reviewed by: kib Submitted by: kib [1]	2016-11-17 20:32:32 +00:00

... 3 4 5 6 7 ...

4203 Commits