freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	9140bff7ed	Remove a bogus assertion from vm_page_launder(). After r328977, a wired page m may have m->queue != PQ_NONE. Reviewed by: kib X-MFC with: r328977 Differential Revision: https://reviews.freebsd.org/D14485	2018-02-23 23:25:22 +00:00
Jeff Roberson	5f8cd1c0bf	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402	2018-02-23 22:51:51 +00:00
Konstantin Belousov	2c0f13aa59	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
Mark Johnston	3f060b60b1	Use the conventional name for an array of pages. No functional change intended. Discussed with: kib MFC after: 3 days	2018-02-16 15:38:22 +00:00
Konstantin Belousov	ada27a3bb8	Cleanup unused page argument for vm_reserv_break(). Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14364	2018-02-14 00:34:02 +00:00
Konstantin Belousov	d929ad7f91	Ensure memory consistency on COW. From the submitter description: The process is forked transitioning a map entry to COW Thread A writes to a page on the map entry, faults, updates the pmap to writable at a new phys addr, and starts TLB invalidations... Thread B acquires a lock, writes to a location on the new phys addr, and releases the lock Thread C acquires the lock, reads from the location on the old phys addr... Thread A ...continues the TLB invalidations which are completed Thread C ...reads from the location on the new phys addr, and releases the lock In this example Thread B and C [lock, use and unlock] properly and neither own the lock at the same time. Thread A was writing somewhere else on the page and so never had/needed the lock. Thread C sees a location that is only ever read\|modified under a lock change beneath it while it is the lock owner. To fix this, perform the two-stage update of the copied PTE. First, the PTE is updated with the address of the new physical page with copied content, but in read-only mode. The pmap locking and the page busy state during PTE update and TLB invalidation IPIs ensure that any writer to the page cannot upgrade the PTE to the writable state until all CPUs updated their TLB to not cache old mapping. Then, after the busy state of the page is lifted, the faults for write can proceed and do not violate the consistency of the reads. The change is done in vm_fault because most architectures do need IPIs to invalidate remote TLBs. More, I think that hardware guarantees of atomicity of the remote TLB invalidation are not enough to prevent the inconsistent reads of non-atomic reads, like multi-word accesses protected by a lock. So instead of modifying each pmap invalidation code, I did it there. Discovered and analyzed by: Elliott.Rabe@dell.com Reviewed by: markj PR: 225584 (appeared to have the same cause) Tested by: Elliott.Rabe@dell.com, emaste, Mike Tancsa <mike@sentex.net>, truckman Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:31:45 +00:00
Konstantin Belousov	607970bc8e	Do not call pmap_enter() with invalid protection mode. If the map entry elookup was performed due to the mapping changes, we need to ensure that there is still some access permission bit requested which is compatible with the current vm_map_entry mode. If not, restart the handler from scratch instead of trying to save the current progress. Also adjust fault_type to not include cleared permission bits. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:25:18 +00:00
Konstantin Belousov	c4be9169c0	Do not leak rv->psind in some specific situations. Suppose that we have an object with a mapped superpage, and that all pages in the superpages are held (by some driver). Additionally, suppose that the object is terminated, e.g. because the only process mapping it is exiting. Then the reservation is broken, but the pages cannot be freed until later, when they are unheld. In this situation, the reservation code cannot clean psind, since no pages are freed, and the page is freed and then reused with invalid psind. Clean psind on vm_reserv_break() to avoid the situation. Reported and tested by: Slava Shwartsman Reviewed by: markj Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14335	2018-02-13 15:36:28 +00:00
Jeff Roberson	e958ad4cf3	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
Gleb Smirnoff	f7d3578564	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
Gleb Smirnoff	5073a08328	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
Mark Johnston	1d3a1bcfac	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943	2018-02-07 16:57:10 +00:00
Gleb Smirnoff	d2be4a1e4f	Use correct arithmetic to calculate how many pages we need for kegs and hashes. There is no functional change with current sizes.	2018-02-06 22:13:40 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Gleb Smirnoff	1616767dfc	Improve DIAGNOSTIC printf. Report using a boot page every time regardless of booted status.	2018-02-06 22:08:43 +00:00
Gleb Smirnoff	ae941b1b4e	Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC. o Call uma_startup1() after initializing kmem, vmem and domains. o Include 8 eight VM startup pages into uma_startup_count() calculation. o Account for vmem_startup() and vm_map_startup() preallocating pages. o Account for extra two allocations done by kmem_init() and vmem_create(). o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT allowed several other SYSINITs to sneak in before it, thus bumping requirement for amount of boot pages.	2018-02-06 22:06:59 +00:00
Mark Johnston	4d2653522d	Delete a declaration for a variable removed in r305362.	2018-02-06 17:26:11 +00:00
Gleb Smirnoff	f4bef67c9c	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
Konstantin Belousov	20e4afbfbb	On munlock(), unwire correct page. It is possible, for complex fork()/collapse situations, to have sibling address spaces to partially share shadow chains. If one sibling performs wiring, it can happen that a transient page, invalid and busy, is installed into a shadow object which is visible to other sibling for the duration of vm_fault_hold(). When the backing object contains the valid page, and the wiring is performed on read-only entry, the transient page is eventually removed. But the sibling which observed the transient page might perform the unwire, executing vm_object_unwire(). There, the first page found in the shadow chain is considered as the page that was wired for the mapping. It is really the page below it which is wired. So we unwire the wrong page, either triggering the asserts of breaking the page' wire counter. As the fix, wait for the busy state to finish if we find such page during unwire, and restart the shadow chain walk after the sleep. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14184	2018-02-05 12:49:20 +00:00
Konstantin Belousov	938cdc4264	On pageout, in vnode generic pager, for partially dirty page, only clear dirty bits for completely invalid blocks. Otherwise we might not write out the last chunk that is shorter than 512 bytes, if the file end is not aligned on disk block boundary. This become important after the r324794. PR: 225586 Reported by: tris_vern@hotmail.com Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-02-02 11:56:30 +00:00
Konstantin Belousov	1c5196c3ed	Assign map->header values to avoid boundary checks. In several places, entry start and end field are checked, after excluding the possibility that the entry is map->header. By assigning max and min values to the start and end fields of map->header in vm_map_init, the explicit map->header checks become unnecessary. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: alc, kib, markj (previous version) Tested by: pho (previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13735	2018-01-20 12:19:02 +00:00
Nathan Whitehorn	9a8196ce19	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
Jeff Roberson	b6715dab8f	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
Jeff Roberson	6f4acaf4c9	Add support for NUMA domains to bus dma tags. This causes all memory allocated with a tag to come from the specified domain if it meets the other constraints provided by the tag. Automatically create a tag at the root of each bus specifying the domain local to that bus if available. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13545	2018-01-12 23:34:16 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Jeff Roberson	7a469c8ef3	Implement NUMA policy for kmem_*(9). This maintains compatibility with reservations by giving each memory domain its own KVA space in vmem that is naturally aligned on superpage boundaries. Reviewed by: alc, markj, kib (some objections) Sponsored by: Netflix, Dell/EMC Isilon Tested by; pho Differential Revision: https://reviews.freebsd.org/D13289	2018-01-12 23:13:55 +00:00
Jeff Roberson	7b11a48326	Add files for r327895 Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:57:57 +00:00
Jeff Roberson	3f289c3fcf	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
Ed Maste	d03890153d	ANSIfy function definitions in sys/vm/	2018-01-12 03:50:44 +00:00
Konstantin Belousov	33937731e7	Restructure swapout tests after vm map locking was removed. Consolidate the regions covered by the process lock. Combine similar conditions tests into one, e.g. all process flags can be test with one logical operation. Add check for in-exec state, since p_vmspace is dererenced. Remove labels and goto by explicitly tracking state. Update comments. Reviewed by: alc, markj (previous version) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13693	2018-01-04 18:14:58 +00:00
Alan Cox	36ca312db5	Once we have decided to swap out a process, don't delay the laundering of its per-thread kernel stack pages by making them pass through the inactive queue first. Instead, immediately place them in the laundry so that they might be cleaned and made available for reclamation sooner. Reviewed by: kib, markj MFC after: 1 week	2018-01-04 03:16:32 +00:00
Jeff Roberson	ad5b0f5b51	Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494	2018-01-02 04:35:56 +00:00
Konstantin Belousov	9997c0481c	Do not let vm_daemon run unbounded. On a load where single anonymous object consumes almost all memory on the large system, swapout code executes the iteration over the corresponding object page queue for long time, owning the map and object locks. This blocks pagedaemon which tries to lock the object, and blocks other threads in the process in vm_fault() waiting for the map lock. Handle the issue by terminating the deactivation loop if we executed too long and by yielding at the top level in vm_daemon. Reported by: peterj, pho Reviewed by: alc Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2018-01-01 19:27:33 +00:00
Alan Cox	7000c58871	The variable "minslptime" is pointless and always has been, ever since its introduction in r83366. (At that time, this code appeared in vm/vm_glue.c, because vm/vm_swapout.c did not exist.) When the FOREACH_THREAD loop completes, we know that the sleep time for every thread is above whichever threshold is being applied. Reviewed by: kib X-MFC with: r327354	2017-12-31 21:36:42 +00:00
Alan Cox	4abca9bb05	Previously, swap_pager_copy() freed swap blocks one at at time, via swp_pager_meta_ctl(), with no opportunity to recognize freeing of consecutive blocks and free fewer block ranges. To open that opportunity, this change removes the SWM_FREE option from swp_pager_meta_ctl(), and compels the caller to do the freeing when a valid block address is returned. In swap_pager_copy(), these frees are aggregated, so that a sequence of them can be done at one time. The only other caller to swp_pager_meta_ctl() that passed SWM_FREE, swp_pager_unswapped(), is also modified to handle its single free explicitly. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib (an earlier version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13290	2017-12-31 04:01:47 +00:00
Konstantin Belousov	b6eabc36ba	Do not lock vm map in swapout_procs(). Neither swapout_procs() nor swapout() access the map. Since the process' vmspace is referenced only to obtain the pointer to the vm_map, the reference is not needed as well. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13681	2017-12-29 20:33:56 +00:00
Konstantin Belousov	e258b4a0dc	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13678	2017-12-29 19:05:07 +00:00
Alan Cox	5c515efc88	After r327168, the variable "vm_pageout_wanted" can be static. MFC after: 2 weeks	2017-12-29 17:02:22 +00:00
Konstantin Belousov	89c0e67db5	Clean up the comment. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2017-12-28 23:50:21 +00:00
Konstantin Belousov	0080a8fa95	In vm_swapout_map_deactivate_pages(), it is enough to lock the map for read. Reviewed by: alc, markj (as part of the larger patch) Tested by: pho (again, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2017-12-28 22:56:30 +00:00
Alan Cox	fec296887f	Refactor vm_map_find(), creating a separate function, vm_map_alignspace(), for finding aligned free space in the given map. With this change, we always return KERN_NO_SPACE when we fail to find free space. Whereas, previously, we might return KERN_INVALID_ADDRESS. Also, with this change, we explicitly check for address wrap, rather than relying upon the map's min and max addresses to establish sentinel-like regions. This refactoring was inspired by the problem that we addressed in r326098. Reviewed by: kib Tested by: pho Discussed with: markj MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D13346	2017-12-26 17:59:37 +00:00
Mark Johnston	65ef323137	Ensure that pass > 0 when starting a scan with vm_pages_needed == 1. Otherwise the page daemon will not reclaim pages and thus will not wake threads sleeping in VM_WAIT. Reported and tested by: pho Reviewed by: alc, kib X-MFC with: r327168 Differential Revision: https://reviews.freebsd.org/D13640	2017-12-26 16:29:39 +00:00
Alan Cox	115423761e	Make the vm object bypass and collapse counters per CPU. Requested by: mjg Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13611	2017-12-25 19:36:04 +00:00
Mark Johnston	280d15cd0a	Fix two problems with the page daemon control loop. Both issues caused the page daemon to erroneously go to sleep when applications are consuming free pages at a high rate, leaving the application threads blocked in VM_WAIT. 1) After completing an inactive queue scan, concurrent allocations may have prevented the page daemon from meeting the v_free_min threshold. In this case, the page daemon was going to sleep even when the inactive queue contained plenty of clean pages. 2) pagedaemon_wakeup() may be called without the free queues lock held. This can lead to a lost wakeup if a call occurs after the page daemon clears vm_pageout_wanted but before going to sleep. Fix 1) by ensuring that we start a new inactive queue scan immediately if v_free_count < v_free_min after a prior scan. Fix 2) by adding a new subroutine, pagedaemon_wait(), called from vm_wait() and vm_waitpfault(). It wakes up the page daemon if either vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps on v_free_count. Reported by: jeff Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13424	2017-12-24 19:45:16 +00:00
Konstantin Belousov	200f8117ba	Perform all accesses to uma_reclaim_needed using atomic(9) KPI. Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 10:06:55 +00:00
Mark Johnston	cb35676e66	Use a dedicated counter for inactive queue scans. The laundry thread keeps track of the number of inactive queue scans performed by the page daemon, and was previously using the v_pdwakeups counter to count them. However, in some cases the inactive queue may be scanned multiple times after a single wakeup, so it's more accurate to use a dedicated counter. Reviewed by: alc, kib (previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13422	2017-12-11 15:33:24 +00:00
Mark Johnston	82e2d06a27	Fix the act_scan_laundry_weight mechanism. r292392 modified the active queue scan to weigh clean pages differently from dirty pages when attempting to meet the inactive queue target. When r306706 was merged into the PQ_LAUNDRY branch, this mechanism was broken. Fix it by scalaing the correct page shortage variable. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13423	2017-12-09 15:47:26 +00:00
Mark Johnston	952a29c04b	Fix the UMA reclaim worker after r326347. atomic_set_*() sets a bit in the target memory location, so atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like it does. PR: 224080 Reviewed by: jeff, kib Differential Revision: https://reviews.freebsd.org/D13412	2017-12-07 19:38:09 +00:00
Mark Johnston	6eebec8343	Use unique wait messages in the page daemon control loop. Discussed with: alc MFC after: 1 week	2017-12-06 18:36:54 +00:00
Andrew Turner	5be9377857	Print the correct value when freelist is out of range. Security: : Sponsored by: DARPA, AFRL	2017-12-04 11:16:51 +00:00
Michael Zhilin	0db2102aaa	[mips] [vm] restore translation of freelist to flind for page allocation Commit r326346 moved domain iterators from physical layer to vm_page one, but it also removed translation of freelist to flind for vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter, but after it expect freelist index. On small WiFi boxes with few megabytes of RAM, there is only one freelist VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file sys/mips/include/vmparam.h). It results in freelist 1 with flind 0. At first, this commit renames flind to freelist in vm_page_alloc_freelist to avoid misunderstanding about input parameters. Then on physical layer it restores translation for correct handling of freelist parameter. Reported by: landonf Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D13351	2017-12-04 08:08:55 +00:00
Konstantin Belousov	e8502826ce	Add comment for vm_map_find_min(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days X-Differential revision: https://reviews.freebsd.org/D13155	2017-12-01 10:53:08 +00:00
Pedro F. Giffuni	796df753f4	SPDX: Consider code from Carnegie-Mellon University. Interesting cases, most likely from CMU Mach sources.	2017-11-30 15:48:35 +00:00
Pedro F. Giffuni	cf3329887e	SPDX: wrong license.	2017-11-30 15:45:42 +00:00
Mark Johnston	57cd81a357	Verify the object/vnode association after vget() in vm_pageout_clean(). It's theoretically possible for the vnode and object to be disassociated while locks are dropped around the vget() call, in which case we shouldn't proceed with laundering. Noted and reviewed by: kib MFC after: 1 week	2017-11-29 19:47:09 +00:00
Mark Johnston	1084894f80	Remove some comments that became incorrect with r325530.	2017-11-29 14:34:05 +00:00
Jeff Roberson	2e47807c21	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187	2017-11-28 23:40:54 +00:00
Jeff Roberson	ef435ae7de	Move domain iterators into the page layer where domain selection should take place. This makes the majority of the phys layer explicitly domain specific. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix & Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13014	2017-11-28 23:18:35 +00:00
Alan Cox	230869e051	When the swap pager allocates space on disk, it requests contiguous blocks in a single call to blist_alloc(). However, when it frees that space, it previously called blist_free() on each block, one at a time. With this change, the swap pager identifies ranges of contiguous blocks to be freed, and calls blist_free() once per range. In one extreme case, that is described in the review, the time to perform an munmap(2) was reduced by 55%. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12397	2017-11-28 17:46:03 +00:00
Mark Johnston	d2b677cef6	Avoid unnecessary lookups when initializing the vm_page array. This gives a marginal improvement in the vm_page_array initialization time. Also garbage-collect the now-unused vm_phys_paddr_to_segind(). Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13270	2017-11-27 17:46:38 +00:00
Pedro F. Giffuni	fe267a5590	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.	2017-11-27 15:23:17 +00:00
Mark Johnston	b20bf182e6	Move vm_phys_init_page() to vm_page.c. Suggested by: kib Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13250	2017-11-26 19:17:55 +00:00
Mark Johnston	830cb6b2b6	Remove unneeded initializations from vm_phys_init_page(). The page allocator always initializes the aflags and oflags fields. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13242	2017-11-26 19:16:45 +00:00
Konstantin Belousov	9410cd7d9e	Return different error code for the guard page layout violation. On KERN_NO_SPACE error, as it is returned now, vm_map_find() continues the loop searching for the suitable range for the requested mapping with specific alignment. Since the vm_map_findspace() succesfully finds the same place, the loop never ends. The errors returned from vm_map_stack() completely repeat the behavior of vm_map_insert() now, as suggested by Alan. Reported by: Arto Pekkanen <aksyom@gmail.com> PR: 223732 Reviewed by: alc, markj Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D13186	2017-11-22 16:45:27 +00:00
Alan Cox	4d572bb3ed	When vm_map_find(find_space = VMFS_OPTIMAL_SPACE) fails to find space, a second scan of the address space with find_space = VMFS_ANY_SPACE is performed. Previously, vm_map_find() released and reacquired the map lock between the first and second scans. However, there is no compelling reason to do so. This revision modifies vm_map_find() to retain the map lock. Reviewed by: jhb, kib, markj MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D13155	2017-11-22 16:39:24 +00:00
Mark Johnston	5070d56d41	Allow for fictitious physical pages in vm_page_scan_contig(). Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter fictitous pages. For now, allow for this possibility; such pages will be skipped later in the scan since they are wired. Reported by: avg Reviewed by: kib MFC after: 1 week	2017-11-21 13:17:40 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Pedro F. Giffuni	df57947f08	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
Konstantin Belousov	1c778d91b5	vmtotal: extend memory counters to accomodate for current and future hardware sizes. 32bit counters already overflow on approachable virtual memory page counts, and soon would overflow on the physical pages counts as well. Bump sizes to 64bit types. Bump __FreeBSD_version. It is impossible to provide perfect backward ABI compat for this change. If a program requests an old structure, it can be detected by size. But if it queries the size first by passing NULL old req pointer, there is almost nothing we can do to detect the desired ABI. As a partial solution, check p_osrel of the quering process when selecting the size to report. Submitted by: Pawel Biernacki <pawel.biernacki@gmail.com> Differential revision: https://reviews.freebsd.org/D13018	2017-11-15 13:41:03 +00:00
Konstantin Belousov	772c8b6749	Fix operator priority. Sponsored by: The FreeBSD Foundation	2017-11-08 23:25:05 +00:00
Mark Johnston	e0b2fc3a51	Allow various page daemon parameters to be set from loader.conf. MFC after: 1 week	2017-11-08 19:55:17 +00:00
Jeff Roberson	8d6fbbb867	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
Mark Johnston	bd0e1beb98	Correct the type of foff. No functional change intended. Github PR: 124 Submitted by: Wuyang Chung <wuyang.m.chung@outlook.com> MFC after: 1 week	2017-11-08 01:53:03 +00:00
Alan Cox	3a757e5403	Micro-optimize the handling of fictitious pages in vm_page_free_prep(). A fictitious page is always wired, so there is no point in trying to remove one from the page queues. Completely remove one inaccurate comment from vm_page_free_prep() and correct another. Reviewed by: kib, markj MFC after: 1 week	2017-10-24 17:14:53 +00:00
Edward Tomasz Napierala	be7d4ac586	Add OID for the vm.overcommit sysctl. This makes it possible to remove one call to sysctl(2) from jemalloc startup code. (That also requires changes to jemalloc, but I plan to push those to upstream first.) Reviewed by: kib MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D12745	2017-10-22 10:35:29 +00:00
Konstantin Belousov	422fe502b3	Check that the page which is freed as zeroed, indeed has all-zero content. This catches some rare mysterious failures at the source. The check is only performed on architectures which implement direct map, and only enabled with option DIAGNOSTIC, similar to other costly consistency checks. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-10-21 17:28:12 +00:00
Mark Johnston	eadbeae5e7	Free the right address range if kmem_back() fails in memguard_alloc(). MFC after: 1 week Sponsored by: Dell EMC Isilon	2017-10-20 21:13:19 +00:00
Konstantin Belousov	b3d4ab6645	Take the vm object lock in read mode in vnode_generic_putpages(). Only upgrade it to write mode if we need to clear dirty bits of the partially valid page after EOF. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2017-10-20 18:40:29 +00:00
Konstantin Belousov	ac04195ba6	Move swapout code into vm/vm_swapout.c. There is no NO_SWAPPING #ifdef left in the code. Requested by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D12663	2017-10-20 09:10:49 +00:00
Konstantin Belousov	05877a8595	Do not overwrite clean blocks on pageout. If filesystem block size is less than the page size, it is possible that the page-out run contains partially clean pages. E.g., the chunk of the page might be bdwrite()-ed, or some thread performed bwrite() on a buffer which references a chunk of the paged out page. As result, the assertion added in r319975, which checked that all pages in the run are dirty, does not hold on such filesystems. One solution is to remove the assert, but it is undesirable, because we do overwrite the valid on-disk content. I cannot provide a scenario where such write would corrupt the file data, but I do not like it on principle. Another, in my opinion proper, solution is to only write parts of the pages still marked dirty. The patch implements this, it skips clean blocks and only writes the dirty block runs. Note that due to clustering, write one page might clean other pages in the run, so the next write range must be calculated only after the current range is written out. More, due to a possible invalidation, and the fact that the object lock is dropped and reacquired before the checks, it is possible that the whole page-out pages run appears to consist of only clean pages. For this reason, it is impossible to assert that there is some work for the pageout method to do (i.e. assert that there is at least one dirty page in the run). But such clearing can only occur due to invalidation, and not due to a parallel write, because we own the vnode lock exclusive. Reported by: fsu In collaboration with: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D12668	2017-10-20 08:32:37 +00:00
Konstantin Belousov	4313989360	In vm_page_free_phys_pglist(), do not take vm_page_queue_free_mtx if there is nothing to do. Suggested by: mjg Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-20 08:25:49 +00:00
Alan Cox	4074d642d2	Batch atomic updates to the number of active, inactive, and laundry pages by vm_object_terminate_pages(). For example, for a "buildworld" workload, this batching reduces vm_object_terminate_pages()'s average execution time by 12%. (The total savings were about 11.7 billion processor cycles.) Reviewed by: kib MFC after: 1 week	2017-10-19 04:13:47 +00:00
Konstantin Belousov	1fffcd755d	Do not report reduction of swap zone if it was not. After r324600 we see the actual reservation. Reported by: jkim Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-18 07:27:43 +00:00
Mateusz Guzik	1dbf52e7d9	Reduce traffic on vm_cnt.v_free_count The variable is modified with the highly contended page free queue lock. It unnecessarily shares a cacheline with purely read-only fields and is re-read after the lock is dropped in the page allocation code making the hold time longer. Pad the variable just like the others and store the value as found with the lock held instead of re-reading. Provides a modest 1%-ish speed up in concurrent page faults. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D12665	2017-10-13 21:54:34 +00:00
Konstantin Belousov	53faf5a7d4	Evaluate the real size of the sblk_zone. Submitted by: ota@j.email.ne.jp PR: 221356 Reviewed by: alc, markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D12660	2017-10-13 16:23:05 +00:00
Ed Maste	6e309d75d2	ANSIfy vm_kern.c PR: 222673 Submitted by: ota@j.email.ne.jp MFC after: 1 week	2017-10-13 13:53:19 +00:00
Alan Cox	37244a84fd	Replace an unnecessary call to vm_page_activate() by an assertion that the page is already wired or queued. Prior to the elimination of PG_CACHED pages, vm_page_grab() might have returned a valid, previously PG_CACHED page, in which case enqueueing the page was necessary. Now, that can't happen. Moreover, activating the page is a dubious choice, since the page is not being accessed. Reviewed by: kib MFC after: 1 week	2017-10-08 16:54:42 +00:00
Alan Cox	41e5a22698	When an I/O error occurs on page out, there is no need to dirty the page, because it is already dirty. Instead, assert that the page is dirty. Reviewed by: kib, markj MFC after: 1 week	2017-10-01 17:04:26 +00:00
Alan Cox	cf060942db	Optimize vm_object_page_remove() by eliminating pointless calls to pmap_remove_all(). If the object to which a page belongs has no references, then that page cannot possibly be mapped. Reviewed by: kib MFC after: 1 week	2017-09-28 17:55:41 +00:00
John Baldwin	14c510c0cf	Add UMA_ALIGNOF(). This is a wrapper around _Alignof() that sets the alignment for a zone to the alignment required by a given type. This allows the compiler to determine the proper alignment rather than having the programmer try to guess. Discussed on: arch@ MFC after: 1 week Sponsored by: DARPA / AFRL	2017-09-27 23:15:33 +00:00
Alan Cox	43cc906f40	Change vm_page_try_to_free() to require a managed page. Essentially, vm_page_try_to_free() is testing conditions, like clean versus dirty, that only vary in managed pages. Suggested by: kib Reviewed by: markj X-MFC after: never	2017-09-24 23:35:01 +00:00
Alan Cox	494c6e43d3	Optimize vm_page_try_to_free(). Specifically, the call to pmap_remove_all() can be avoided when the page's containing object has a reference count of zero. (If the object has a reference count of zero, then none of its pages can possibly be mapped.) Address nearby style issues in vm_page_try_to_free(), and change its return type to "bool". Reviewed by: kib, markj MFC after: 1 week	2017-09-24 16:50:10 +00:00
Konstantin Belousov	5bf949377e	For unlinked files, do not msync(2) or sync on the vnode deactivation. One consequence of the patch is that msyncing unlinked file mappings no longer reduces the amount of the dirty memory in the system, but I do not think that there are users of msync(2) that utilize it for such side-effect. Reported and tested by: tjil PR: 222356 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12411	2017-09-19 16:46:37 +00:00
Konstantin Belousov	bba52ecadd	Batch freeing of the pages in vm_object_page_remove() under the same free queue mutex lock owning session, same as it was done for the object termination in r323561. Reported and tested by: mjg Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-15 16:07:09 +00:00
Mark Johnston	e04223bf94	Include _bitset.h to get BITSET_DEFINE, used to define struct slabbits. MFC after: 1 week	2017-09-15 14:59:35 +00:00
Mark Johnston	2d54d4bb9f	Widen uk_pgoff, the slab header offset field. 16 bits is only wide enough for kegs with an item size of up to 64KB. At that size or larger, slab headers are typically offpage because the item size is a multiple of the page size, but there is no requirement that this be the case. We can widen the field without affecting the layout of struct uma_keg since the removal of uk_slabsize in r315077 left an adjacent hole. PR: 218911 MFC after: 2 weeks	2017-09-13 21:54:37 +00:00
Konstantin Belousov	e82e50e681	Remove inline specifier from vm_page_free_wakeup(), do not micro-manage compiler. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:30:09 +00:00
Konstantin Belousov	2fcd1ff68f	Do not relock free queue mutex for each page, free whole terminating object' page queue under the single mutex lock. First, all pages on the queue are prepared for free by calls to vm_page_free_prep(), and pages which should not be returned to the physical allocator (e.g. wired or fictitious) are simply removed from the queue. On the second pass, vm_page_free_phys_pglist() inserts all pages from the queue without relocking the mutex. The change improves the object termination, e.g. on the process exit where large anonymous memory objects otherwise cause relocks the free queue mutex for each page. More, if several such processes are exiting or execing in parallel, the mutex was highly contended on the address space demolition. Diagnosed and tested by: mjg (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:22:07 +00:00
Konstantin Belousov	540ac3b310	Split vm_page_free_toq() into two parts, preparation vm_page_free_prep() and insertion into the phys allocator free queues vm_page_free_phys(). Also provide a wrapper vm_page_free_phys_pglist() for batched free. Reviewed by: alc, markj Tested by: mjg (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:11:52 +00:00
Konstantin Belousov	b9e8fb647e	Use existing tag name for the vm_object' memq. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-13 19:03:59 +00:00

1 2 3 4 5 ...

3831 Commits