freebsd-nq

Author	SHA1	Message	Date
Mateusz Guzik	2554f86a8d	vm: stop taking proc lock in mmap to satisfy racct if it is disabled Limits can be safely obtained with lim_cur from the thread. racct is compiled in but disabled by default. Note that racct enablement is a boot-only tunable. This eliminates second most common place of taking the lock while pkg building. While here don't take the lock in mlockall either. Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17210	2018-09-18 01:24:30 +00:00
Mark Johnston	7a364d458a	Split some checks in vm_page_activate() to make it easier to read. No functional change intended. Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17028	2018-09-10 18:59:23 +00:00
Mark Johnston	5a7f993702	Relax an assertion in vm_pqbatch_process_page(). While executing vm_pqbatch_process_page(m), m->queue may change to PQ_NONE if the page daemon is concurrently freeing the page. In this case m's queue state flags must be clear, so vm_pqbatch_process_page() will be a no-op, but the race could cause spurious assertion failures. Correct the assertion which assumed that m->queue's value does not change while the page queue lock is held. Reviewed by: alc, kib Reported and tested by: pho Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17027	2018-09-08 21:49:43 +00:00
Mark Johnston	c56c7299c2	Use the correct terminology. Reported by: kib Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D16191	2018-09-06 20:02:19 +00:00
Mark Johnston	23984ce5cd	Avoid resource deadlocks when one domain has exhausted its memory. Attempt other allowed domains if the requested domain is below the minimum paging threshold. Block in fork only if all domains available to the forking thread are below the severe threshold rather than any. Submitted by: jeff Reported by: mjg Reviewed by: alc, kib, markj Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D16191	2018-09-06 19:28:52 +00:00
Mark Johnston	21f01f4584	Remove vm_page_remque(). Testing m->queue != PQ_NONE is not sufficient; see the commit log message for r338276. As of r332974 vm_page_dequeue() handles already-dequeued pages, so just replace vm_page_remque() calls with vm_page_dequeue() calls. Reviewed by: kib Tested by: pho Approved by: re (marius) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17025	2018-09-06 16:17:45 +00:00
Alan Cox	72aebdd742	Recent changes have created, for the first time, physical memory segments that can be coalesced. To be clear, fragmentation of phys_avail[] is not the cause. This fragmentation of vm_phys_segs[] arises from the "special" calls to vm_phys_add_seg(), in other words, not those that derive directly from phys_avail[], but those that we create for the initial kernel page table pages and now for the kernel and modules loaded at boot time. Since we sometimes iterate over the physical memory segments, coalescing these segments at initialization time is a worthwhile change. Reviewed by: kib, markj Approved by: re (rgrimes) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16976	2018-09-02 18:29:38 +00:00
Konstantin Belousov	f0165b1ca6	Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines. Exposing max_offset and min_offset defines in public headers is causing clashes with variable names, for example when building QEMU. Based on the submission by: royger Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Approved by: re (marius) Differential revision: https://reviews.freebsd.org/D16881	2018-08-29 12:24:19 +00:00
Mark Murray	19fa89e938	Remove the Yarrow PRNG algorithm option in accordance with due notice given in random(4). This includes updating of the relevant man pages, and no-longer-used harvesting parameters. Ensure that the pseudo-unit-test still does something useful, now also with the "other" algorithm instead of Yarrow. PR: 230870 Reviewed by: cem Approved by: so(delphij,gtetlow) Approved by: re(marius) Differential Revision: https://reviews.freebsd.org/D16898	2018-08-26 12:51:46 +00:00
Alan Cox	49bfa624ac	Eliminate the arena parameter to kmem_free(). Implicitly this corrects an error in the function hypercall_memfree(), where the wrong arena was being passed to kmem_free(). Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are mapped in kmem with execute permissions. Use this flag to determine which arena the kmem virtual addresses are returned to. Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it redundant. Update the nearby comment for UMA_SLAB_KERNEL. Reviewed by: kib, markj Discussed with: jeff Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16845	2018-08-25 19:38:08 +00:00
Gleb Smirnoff	306abf0f35	Either "free" or "allocated" is misleading here, since an item in a bucket is free from perspective of UMA consumer, and it is allocated from perspective of keg. Discussed with: markj Approved by: re (kib)	2018-08-24 18:47:50 +00:00
Gleb Smirnoff	a307fb5b0c	Fix comment. The actual meaning of ub_cnt is the opposite.	2018-08-23 23:24:28 +00:00
Mark Johnston	899fe184c7	Add a per-pagequeue pdpages counter. Expose these counters under the vm.domain sysctl node. The existing vm.stats.vm.v_pdpages sysctl is preserved. Reviewed by: alc (previous version) Differential Revision: https://reviews.freebsd.org/D14666	2018-08-23 21:03:45 +00:00
Mark Johnston	99d92d732f	Ensure that queue state is cleared when vm_page_dequeue() returns. Per-page queue state is updated non-atomically, with either the page lock or the page queue lock held. When vm_page_dequeue() is called without the page lock, in rare cases a different thread may be concurrently dequeuing the page with the pagequeue lock held. Because of the non-atomic update, vm_page_dequeue() might return before queue state is completely updated, which can lead to race conditions. Restrict the vm_page_dequeue() interface so that it must be called either with the page lock held or on a free page, and busy wait when a different thread is concurrently updating queue state, which must happen in a critical section. While here, do some related cleanup: inline vm_page_dequeue_locked() into its only caller and delete a prototype for the unimplemented vm_page_requeue_locked(). Replace the volatile qualifier for "queue" added in r333703 with explicit uses of atomic_load_8() where required. Reported and tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D15980	2018-08-23 20:34:22 +00:00
Alan Cox	83a90bffd8	Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter became unused in FreeBSD 12.x as a side-effect of the NUMA-related changes.) Reviewed by: kib, markj Discussed with: jeff, re@ Differential Revision: https://reviews.freebsd.org/D16825	2018-08-21 16:43:46 +00:00
Alan Cox	44d0efb215	Eliminate kmem_alloc_contig()'s unused arena parameter. Reviewed by: hselasky, kib, markj Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D16799	2018-08-20 15:57:27 +00:00
Alan Cox	db7c2a4822	Eliminate the unused arena parameter from kmem_alloc_attr(). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D16793	2018-08-18 22:07:48 +00:00
Alan Cox	067fd85894	Eliminate the arena parameter to kmem_malloc_domain(). It is redundant. The domain and flags parameters suffice. In fact, the related functions kmem_alloc_{attr,contig}_domain() don't have an arena parameter. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D16713	2018-08-18 18:33:50 +00:00
Konstantin Belousov	c1344d2bbe	Prevent some parallel swap-ins, rate-limit swapper swap-ins. If faultin() was called outside swapper (from PHOLD()), do not allow swapper to initiate additional swap-ins. Swapper' initiated swap-ins are serialized because they are synchronous and executed in the context of the thread0. With the added limitation, we only allow parallel swap-ins from PHOLD(), which is up to PHOLD() users to manage, usually they do not need to. Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds interval, counting faultin() swapins. Suggested by: alc Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16610	2018-08-13 16:48:46 +00:00
Mark Johnston	b50a4ea646	Account for the lowmem handlers in the inactive queue scan target. Before r329882 the target would be computed after lowmem handlers run and free pages. On some systems a significant amount of page reclamation happens this way. However, with r329882 the target is computed first, which can lead to unnecessary reclamation from the page cache, and this in turn may result in excessive swapping. Instead, adjust the target after running lowmem handlers. Don't invoke the lowmem handlers before the PID controller, though, since that would hide the true rate of page allocation. Reviewed by: alc, kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16606	2018-08-09 18:25:49 +00:00
Alan Cox	2bf8cb3804	Add support for pmap_enter(..., psind=1) to the armv6 pmap. In other words, add support for explicitly requesting that pmap_enter() create a 1 MB page mapping. (Essentially, this feature allows the machine-independent layer to create superpage mappings preemptively, and not wait for automatic promotion to occur.) Export pmap_ps_enabled() to the machine-independent layer. Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail or reclaim a PV entry when one is not available. Refactor pmap_enter_pte1() into two functions, one by the same name, that is a general-purpose function for creating pte1 mappings, and another, pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute- only mappings for execve(2), mmap(2), and shmat(2). In addition, as an optimization to pmap_enter(..., psind=0), eliminate the use of pte2_is_managed() from pmap_enter(). Unlike the x86 pmap implementations, armv6 does not have a managed bit defined within the PTE. So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n) in the number of vm_phys_segs[]. All but one call to PHYS_TO_VM_PAGE() in pmap_enter() can be avoided. Reviewed by: kib, markj, mmel Tested by: mmel MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D16555	2018-08-08 16:55:01 +00:00
Alan Cox	78f1deeffe	Defer and aggregate swap_pager_meta_build frees. Before swp_pager_meta_build replaces an old swapblk with an new one, it frees the old one. To allow such freeing of blocks to be aggregated, have swp_pager_meta_build return the old swap block, and make the caller responsible for freeing it. Define a pair of short static functions, swp_pager_init_freerange and swp_pager_update_freerange, to do the initialization and updating of blk addresses and counters used in aggregating blocks to be freed. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib, markj (an earlier version) Tested by: pho MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13707	2018-08-08 02:30:34 +00:00
Konstantin Belousov	a70e9a1388	Swap in WKILLED processes. Swapped-out process that is WKILLED must be swapped in as soon as possible. The reason is that such process can be killed by OOM and its pages can be only freed if the process exits. To exit, the kernel stack of the process must be mapped. When allocating pages for the stack of the WKILLED process on swap in, use VM_ALLOC_SYSTEM requests to increase the chance of the allocation to succeed. Add counter of the swapped out processes to avoid unneeded iteration over the allprocs list when there is no work to do, reducing the allproc_lock ownership. Reviewed by: alc, markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16489	2018-08-04 20:45:43 +00:00
Mark Johnston	c16bd872dc	Add the required page accounting to kmem_bootstrap_free(). Reviewed by: alc, kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16581	2018-08-03 16:35:37 +00:00
Konstantin Belousov	e45b89d23d	Add pmap_is_valid_memattr(9). Discussed with: alc Sponsored by: The FreeBSD Foundation, Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15583	2018-08-01 18:45:51 +00:00
Konstantin Belousov	6e1d2cf679	For compat32, emulate the same wraparound check as occurs on the real ILP32 system. Reported by and discussed with: asomers PR: 230162 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16525	2018-07-31 18:00:47 +00:00
Alan Cox	005783a0a6	Allow vm object coalescing to occur in the midst of a vm object when the OBJ_ONEMAPPING flag is set. In other words, allow recycling of existing but unused subranges of a vm object when the OBJ_ONEMAPPING flag is set. Such situations are increasingly common with jemalloc >= 5.0. This change has the expected effect of reducing the number of vm map entry and object allocations and increasing the number of superpage promotions. Reviewed by: kib, markj Tested by: pho MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D16501	2018-07-31 17:41:48 +00:00
Alan Cox	737e25f7eb	To date, mlockall(MCL_FUTURE) has had the unfortunate side effect of blocking vm map entry and object coalescing for the calling process. However, there is no reason that mlockall(MCL_FUTURE) should block such coalescing. This change enables it. Reviewed by: kib, markj Tested by: pho MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D16413	2018-07-28 04:06:33 +00:00
Warner Losh	67d33338c0	Rename VM_FREELIST_ISADMA to VM_FREELIST_LOWMEM. There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM except for the default boundary (16MB on x86 and 256MB on MIPS, but they are otherwise the same). We don't need both for any system we support (there were some really old ARC systems that did have ISA/EISA bus, but we never ran on them and they are too old to ever grow support for). Differential Review: https://reviews.freebsd.org/D16290	2018-07-27 18:34:20 +00:00
Mark Johnston	6c85795a25	Fix handling of KVA in kmem_bootstrap_free(). Do not use vm_map_remove() to release KVA back to the system. Because kernel map entries do not have an associated VM object, with r336030 the vm_map_remove() call will not update the kernel page tables. Avoid relying on the vm_map layer and instead update the pmap and release KVA to the kernel arena directly in kmem_bootstrap_free(). Because the pmap updates will generally result in superpage demotions, modify pmap_init() to insert PTPs shadowed by superpage mappings into the kernel pmap's radix tree. While here, port r329171 to i386. Reported by: alc Reviewed by: alc, kib X-MFC with: r336505 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16426	2018-07-27 15:46:34 +00:00
Li-Wen Hsu	03154ade2a	Use __riscv to determine building for RISC-V Reviewed by: br Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16398	2018-07-23 19:49:54 +00:00
Mark Johnston	398a929f42	Add support for pmap_enter(psind = 1) to the arm64 pmap. See the commit log messages for r321378 and r336288 for descriptions of this functionality. Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D16303	2018-07-20 16:37:04 +00:00
Mark Johnston	483f692ea6	Have preload_delete_name() free pages backing preloaded data. On i386 and amd64, add a vm_phys segment for physical memory used to store the kernel binary and other preloaded data. This makes it possible to free such memory back to the system once it is no longer needed, e.g., when a preloaded kernel module is unloaded. Previously, it would have remained unused. Reviewed by: kib, royger MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16330	2018-07-19 20:00:28 +00:00
Alan Cox	103cc0f6ea	Revert r329254. The underlying cause for the copy-on-write problem in multithreaded programs that was addressed by r329254 was in the implementation of pmap_enter() on some architectures, notably, amd64. kib@, markj@ and I have audited all of the pmap_enter() implementations, and fixed the broken ones, specifically, amd64 (r335784, r335971), i386 (r336092), mips (r336248), and riscv (r336294). To be clear, the reason to address the problem within pmap_enter() and revert r329254 is not just a matter of principle. An effect of r329254 was that a copy-on-write fault actually entailed two page faults, not one, even for single-threaded programs. Now, in the expected case for either single- or multithreaded programs, we are back to a single page fault to complete a copy-on-write operation. (In extremely rare circumstances, a multithreaded program could suffer two page faults.) Reviewed by: kib, markj Tested by: truckman MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D16301	2018-07-19 17:01:10 +00:00
Alan Cox	d7aeb429a0	Test PGA_REFERENCED after calling pmap_ts_referenced(), rather than before, so that a reference from a concurrently destroyed mapping is observed during the current scan. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16277	2018-07-15 19:25:15 +00:00
Alan Cox	8c0873714c	Add support for pmap_enter(..., psind=1) to the i386 pmap. In other words, add support for explicitly requesting that pmap_enter() create a 2 or 4 MB page mapping. (Essentially, this feature allows the machine-independent layer to create superpage mappings preemptively, and not wait for automatic promotion to occur.) Export pmap_ps_enabled() to the machine-independent layer. Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or reclaim a PV entry when one is not available. Refactor pmap_enter_pde() into two functions, one by the same name, that is a general-purpose function for creating PDE PG_PS mappings, and another, pmap_enter_4mpage(), that is used to prefault 2 or 4 MB read- and/or execute-only mappings for execve(2), mmap(2), and shmat(2). Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D16246	2018-07-14 17:20:27 +00:00
Mateusz Guzik	efb6d4a479	uma: whack main zone counter update in the slow path, freeing side See r333052.	2018-07-12 22:35:52 +00:00
Mark Johnston	013072f04c	Fix pre-SI_SUB_CPU initialization of per-CPU counters. r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the backend allocator for PCPU UMA zones. Unlike page_alloc(), it does not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that. r336020 also changed counter(9) to initialize each counter using a CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU, smp_rendezvous() will only execute the callback on the current CPU (i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed by virtue of the fact that UMA gratuitously zeroes slabs when importing them into a zone. Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by iterating over the full range of CPU IDs when zeroing counters, ignoring whether the corresponding bits in all_cpus are set. Reported and tested by: pho (previous version) Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D16190	2018-07-10 00:18:12 +00:00
Sean Bruno	a03af34228	Wrap the declaration and assignment of "stripe" with #ifdef NUMA declarations as not all targets are NUMA aware. Found with gcc. Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16113	2018-07-07 13:37:44 +00:00
Jeff Roberson	2ef6727edd	Use the ticks since the last update to reduce hysteresis in the partpopq and contention on the vm_reserv_domain lock. This gives a roughly 8x speedup on will-it-scale fault1 on a 16 core machine. Reviewed by: alc, kib, markj	2018-07-07 01:54:45 +00:00
Konstantin Belousov	32f0fefc39	Save a call to pmap_remove() if entry cannot have any pages mapped. Due to the way rtld creates mappings for the shared objects, each dso causes unmap of at least three guard map entries. For instance, in the buildworld load, this change reduces the amount of pmap_remove() calls by 1/5. Profiled by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16148	2018-07-06 12:44:48 +00:00
Konstantin Belousov	be7be41275	Style: no need for braces around single-line then clause. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D16148	2018-07-06 12:37:46 +00:00
Matt Macy	ab3059a8e7	Back pcpu zone with domain correct pages - Change pcpu zone consumers to use a stride size of PAGE_SIZE. (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier) - Allocate page from the correct domain for a given cpu. - Don't initialize pc_domain to non-zero value if NUMA is not defined There are some misconceptions surrounding this field. It is the _VM_ NUMA domain and should only ever correspond to valid domain values as understood by the VM. The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain. Reviewed by: jeff Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15933	2018-07-06 02:06:03 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Konstantin Belousov	a66d7a8ddc	Copyout(9) on 4/4 i386 needs correct vm_page_array[]. On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold() on arbitrary userspace mapping. If the mapping is backed by the non-managed cdev pager or by the sg pager, on dense configs we might access arbitrary element of vm_page_array[], in particular, not corresponding to a page from the memory segment. Initialize such pages as fictitious with the corresponding physical address. Reported by: bde Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D16085	2018-07-05 16:43:15 +00:00
Alan Cox	370a338a7d	Allow callers to vm_phys_split_pages() to specify whether insertion should occur at the head or the tail of the page queues.	2018-07-05 02:08:57 +00:00
Matt Macy	f4b3640475	inline atomics and allow tied modules to inline locks - inline atomics in modules on i386 and amd64 (they were always inline on other arches) - allow modules to opt in to inlining locks by specifying MODULE_TIED=1 in the makefile Reviewed by: kib Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16079	2018-07-02 19:48:38 +00:00
Alan Cox	7493904eca	Introduce vm_phys_enq_range(), and call it in vm_phys_alloc_npages() and vm_phys_alloc_seg_contig() instead of vm_phys_free_contig(). In short, vm_phys_enq_range() is simpler and faster than the more general vm_phys_free_contig(), and in the case of vm_phys_alloc_seg_contig(), vm_phys_free_contig() was placing the excess physical pages at the wrong end of the queues. In collaboration with: Doug Moore <dougm@rice.edu>	2018-07-02 17:18:46 +00:00
Alan Cox	9161b4de54	Three changes to vm_phys_alloc_seg_contig(): 1. Optimize the order computation. 2. Update the pool for all of the chunks that are removed from the free page lists, and not just the first chunk. 3. Simplify the code for returning excess pages to the free page lists. Reviewed by: Doug Moore <dougm@rice.edu>	2018-06-29 04:08:14 +00:00
Alan Cox	32d81f21b9	Reflow one of the comments describing vm_phys_alloc_npages().	2018-06-28 17:52:06 +00:00
Ed Maste	e8a1ec3e05	Split kern_break from sys_break and use it in linuxulator Previously the linuxulator's linux_brk invoked the FreeBSD sys_break syscall implementation directly. Instead, move the bulk of the existing implementation to kern_break, and call that from both sys_break and linux_brk. This also addresses a minor bug in linux_brk in that we now return the actual (rounded up) break address, rather than the requested value. Reviewed by: brooks (earlier version) Sponsored by: Turing Robotic Industries Differential Revision: https://reviews.freebsd.org/D16019	2018-06-27 14:45:13 +00:00
Alan Cox	89ea39a727	Update the physical page selection strategy used by vm_page_import() so that it does not cause rapid fragmentation of the free physical memory. Reviewed by: jeff, markj (an earlier version) Differential Revision: https://reviews.freebsd.org/D15976	2018-06-26 18:29:56 +00:00
Mateusz Guzik	a3d799fbb5	vm: stop passing M_ZERO when allocating radix nodes Allocation explicitely initialized the 3 leading fields. The rest is an array which is supposed to be NULL-ed prior to deallocation. Delegate zeroing to the infrequently called object initializator. This gets rid of one of the most common memset consumers. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D15989	2018-06-24 13:08:05 +00:00
Jeff Roberson	63b5557b2f	Sort uma_zone fields according to 64 byte cache line with adjacent line prefetch on 64bit architectures. Prior to this, two lines were needed for the fast path and each line may fetch an unused adjacent neighbor. - Move fields used by the fast path into a single line. - Move constants into the adjacent line which is mostly used for the spare bucket alloc 'medium path'. - Unpad the mtx which is only used by the fast path and place it in a line with rarely used data. This aligns the cachelines better and eliminates 128 bytes of wasted space. This gives a 45% improvement on a will-it-scale test on a 24 core machine. Reviewed by: mmacy	2018-06-23 08:10:09 +00:00
Ian Lepore	c5b7751fa2	Eliminate a spurious panic on non-SMP systems (occurred on shutdown/reboot).	2018-06-22 20:22:26 +00:00
Ruslan Bukin	b47999470d	Fix uma_zalloc_pcpu_arg() operation in case of !SMP build. Reviewed by: mjg Sponsored by: DARPA, AFRL	2018-06-21 11:43:54 +00:00
Brooks Davis	9da5364ed9	Name the implementation of brk and sbrk sys_break(). The break() system call was renamed (several times) starting in v3 AT&T UNIX when C was invented and break was a language keyword. The last vestage of a need for it to be called something else (eg obreak) was removed in r225617 which consistantly prefixed all syscall implementations. Reviewed by: emaste, kib (older version) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15638	2018-06-14 21:27:25 +00:00
Konstantin Belousov	b7b8a09658	Handle the race between fork/vm_object_split() and faults. If fault started before vmspace_fork() locked the map, and then during fork, vm_map_copy_entry()->vm_object_split() is executed, it is possible that the fault instantiate the page into the original object when the page was already copied into the new object (see vm_map_split() for the orig/new objects terminology). This can happen if split found a busy page (e.g. from the fault) and slept dropping the objects lock, which allows the swap pager to instantiate read-behind pages for the fault. Then the restart of the scan can see a page in the scanned range, where it was already copied to the upper object. Fix it by instantiating the read-ahead pages before swap_pager_getpages() method drops the lock to allocate pbuf. The object scan would see the whole range prefilled with the busy pages and not proceed the range. Note that vm_fault rechecks the map generation count after the object unlock, so that it restarts the handling if raced with split, and re-lookups the right page from the upper object. In collaboration with: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-06-14 19:41:02 +00:00
Jonathan T. Looney	0766f278d8	Make UMA and malloc(9) return non-executable memory in most cases. Most kernel memory that is allocated after boot does not need to be executable. There are a few exceptions. For example, kernel modules do need executable memory, but they don't use UMA or malloc(9). The BPF JIT compiler also needs executable memory and did use malloc(9) until r317072. (Note that a side effect of r316767 was that the "small allocation" path in UMA on amd64 already returned non-executable memory. This meant that some calls to malloc(9) or the UMA zone(9) allocator could return executable memory, while others could return non-executable memory. This change makes the behavior consistent.) This change makes malloc(9) return non-executable memory unless the new M_EXEC flag is specified. After this change, the UMA zone(9) allocator will always return non-executable memory, and a KASSERT will catch attempts to use the M_EXEC flag to allocate executable memory using uma_zalloc() or its variants. Allocations that do need executable memory have various choices. They may use the M_EXEC flag to malloc(9), or they may use a different VM interfact to obtain executable pages. Now that malloc(9) again allows executable allocations, this change also reverts most of r317072. PR: 228927 Reviewed by: alc, kib, markj, jhb (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15691	2018-06-13 17:04:41 +00:00
Mateusz Guzik	4e180881ae	uma: implement provisional api for per-cpu zones Per-cpu zone allocations are very rarely done compared to regular zones. The intent is to avoid pessimizing the latter case with per-cpu specific code. In particular contrary to the claim in r334824, M_ZERO is sometimes being used for such zones. But the zeroing method is completely different and braching on it in the fast path for regular zones is a waste of time.	2018-06-08 21:40:03 +00:00
Mateusz Guzik	b8af2820f6	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.	2018-06-08 05:40:36 +00:00
Mateusz Guzik	ea99223ec9	uma: remove M_ZERO support for pcpu zones Nothing in the tree uses it and pcpu zones have a fundamentally different use case than the regular zones - they are not supposed to be allocated and freed all the time. This reduces pollution in the allocation fast path.	2018-06-08 03:16:16 +00:00
Gleb Smirnoff	c5deaf0452	UMA memory debugging enabled with INVARIANTS consists of two things: trashing freed memory and checking that allocated memory is properly trashed, and also of keeping a bitset of freed items. Trashing/checking creates a lot of CPU cache poisoning, while keeping debugging bitsets consistent creates a lot of contention on UMA zone lock(s). The performance difference between INVARIANTS kernel and normal one is mostly attributed to UMA debugging, rather than to all KASSERT checks in the kernel. Add loader tunable vm.debug.divisor that allows either to turn off UMA debugging completely, or turn it on only for a fraction of allocations, while still running all KASSERTs in kernel. That allows to run INVARIANTS kernels in production environments without reducing load by orders of magnitude, but still doing useful extra checks. Default value is 1, meaning debug every allocation. Value of 0 would disable UMA debugging completely. Values above 1 enable debugging only for every N-th item. It isn't possible to strictly follow the number, but still amount of debugging is reduced roughly by (N-1)/N percent. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15199	2018-06-08 00:15:08 +00:00
Jonathan T. Looney	16e05b3275	Fix a typo in vm_domain_set(). When a domain crosses into the severe range, we need to set the domain bit from the vm_severe_domains bitset (instead of clearing it). Reviewed by: jeff, markj Sponsored by: Netflix, Inc.	2018-06-07 13:29:54 +00:00
Mark Johnston	9f9c9b22ec	Reimplement brk() and sbrk() to avoid the use of _end. Previously, libc.so would initialize its notion of the break address using _end, a special symbol emitted by the static linker following the bss section. Compatibility issues between lld and ld.bfd could cause the wrong definition of _end (libc.so's definition rather than that of the executable) to be used, breaking the brk()/sbrk() interface. Avoid this problem and future interoperability issues by simply not relying on _end. Instead, modify the break() system call to return the kernel's view of the current break address, and have libc initialize its state using an extra syscall upon the first use of the interface. As a side effect, this appears to fix brk()/sbrk() usage in executables run with rtld direct exec, since the kernel and libc.so no longer maintain separate views of the process' break address. PR: 228574 Reviewed by: kib (previous version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D15663	2018-06-04 19:35:15 +00:00
Mark Johnston	27e29d103f	Correct the description of vm_pageout_scan_inactive() after r334508. Reported by: alc	2018-06-04 16:46:36 +00:00
Alan Cox	3e7cb27cdd	Use a single, consistent approach to returning success versus failure in vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix- style "return (0);" to indicate success in the common case, but Mach- style return values in the edge cases. Since KERN_SUCCESS equals zero, the only problem with this inconsistency was stylistic. vm_map_madvise() has exactly two callers in the entire source tree, and only one of them cares about the return value. That caller, kern_madvise(), can be simplified if vm_map_madvise() consistently uses Unix-style return values. Since vm_map_madvise() uses the variable modify_map as a Boolean, make it one. Eliminate a redundant error check from kern_madvise(). Add a comment explaining where the check is performed. Explicitly note that exec_release_args_kva() doesn't care about vm_map_madvise()'s return value. Since MADV_FREE is passed as the behavior, the return value will always be zero. Reviewed by: kib, markj MFC after: 7 days	2018-06-04 16:28:06 +00:00
Justin Hibbits	12f691959f	Align UMA data to 128 byte cacheline size Suggested by: mjg	2018-06-04 15:44:17 +00:00
Mark Johnston	49a3710c89	Remove the "pass" variable from the page daemon control loop. It serves little purpose after r308474 and r329882. As a side effect, the removal fixes a bug in r329882 which caused the page daemon to periodically invoke lowmem handlers even in the absence of memory pressure. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D15491	2018-06-02 00:01:07 +00:00
Konstantin Belousov	633d3b1c71	Only check for MAP_32BIT when available. Reported by: mmacy Sponsored by: The FreeBSD Foundation MFC after: 10 days	2018-06-01 23:50:51 +00:00
Alan Cox	60221a5701	Only a small subset of mmap(2)'s flags should be used in combination with the flag MAP_GUARD. Rather than enumerating the flags that are not allowed, enumerate the flags that are allowed. The list of allowed flags is much shorter and less likely to change. (As an aside, one of the previously enumerated flags, MAP_PREFAULT, was not even a legal flag for mmap(2). However, because of an earlier check within kern_mmap(), this misuse of MAP_PREFAULT was harmless.) Reviewed by: kib MFC after: 10 days	2018-06-01 21:37:42 +00:00
Mark Johnston	6939b4d3b4	Typo. PR: 228533 Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> MFC after: 1 week	2018-05-30 16:48:48 +00:00
Alan Cox	6e1e759c56	Addendum to r334233. In vm_fault_populate(), since the page lock is held, we must use vm_page_xunbusy_maybelocked() rather than vm_page_xunbusy() to unbusy the page. Reviewed by: kib X-MFC with: r334233	2018-05-28 16:23:39 +00:00
Alan Cox	fccdefa1a1	Eliminate duplicate assertions. We assert at the start of vm_fault_hold() that the map entry is wired if the caller passes the flag VM_FAULT_WIRE. Eliminate the same assertion, but spelled differently, at the end of vm_fault_hold() and vm_fault_populate(). Repeat the assertion only if the map is unlocked and the map lookup must be repeated. Reviewed by: kib MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D15582	2018-05-28 04:38:10 +00:00
Alan Cox	70183daa80	Use pmap_enter(..., psind=1) in vm_fault_populate() on amd64. While superpage mappings were already being created by automatic promotion in vm_fault_populate(), this change reduces the cost of creating those mappings. Essentially, one pmap_enter(..., psind=1) call takes the place of 512 pmap_enter(..., psind=0) calls, and that one pmap_enter(..., psind=1) call eliminates the allocation of a page table page. Reviewed by: kib MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D15572	2018-05-26 02:59:34 +00:00
Brooks Davis	7351a8bdb5	Make vadvise compat freebsd11. The vadvise syscall (aka ovadvise) is undocumented and has always been implmented as returning EINVAL. Put the syscall under COMPAT11 and provide a userspace implementation. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15557	2018-05-25 20:40:23 +00:00
Alan Cox	d3f8534e99	Eliminate an unused parameter from vm_fault_populate(). Reviewed by: kib MFC after: 10 days	2018-05-24 20:43:41 +00:00
Mark Johnston	7bb4634e18	Update r334154 with review feedback from D15490. An old revision was committed by accident. Differential Revision: https://reviews.freebsd.org/D15490	2018-05-24 20:26:37 +00:00
Brooks Davis	758d46cfb0	Don't implement break(2) at all on aarch64 and riscv. This should have been done when they were removed from libc, but was overlooked in the runup to 11.0. No users should exist. Approved by: andrew Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15539	2018-05-24 17:04:27 +00:00
Mark Johnston	be37ee791f	Split the active and inactive queue scans into separate subroutines. The scans are largely independent, so this helps make the code marginally neater, and makes it easier to incorporate feedback from the active queue scan into the page daemon control loop. Improve some comments while here. No functional change intended. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D15490	2018-05-24 14:16:22 +00:00
Mark Johnston	a99ee60b9a	Ensure that "m" is initialized in vm_page_alloc_freelist_domain(). While here, remove a superfluous comment. Coverity CID: 1383559 MFC after: 3 days	2018-05-22 16:19:48 +00:00
Mark Johnston	23d123c6cf	Use the canonical check for reservation support.	2018-05-19 23:49:13 +00:00
Mark Johnston	01f04471f4	Don't increment addl_page_shortage for wired pages. Such pages are dequeued as they're encountered during the inactive queue scan, so by the time we get to the active queue scan, they should have already been subtracted from the inactive queue length. Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D15479	2018-05-18 16:59:58 +00:00
Mark Johnston	ba2b3349e1	Fix a race in vm_page_pagequeue_lockptr(). The value of m->queue must be cached after comparing it with PQ_NONE, since it may be concurrently changing. Reported by: glebius Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D15462	2018-05-17 04:27:08 +00:00
Matt Macy	73e37d1deb	Fix powerpc64 LINT vm_object_reserve() == true is impossible on power. Make conditional on VM_LEVEL_0_ORDER being defined. Reviewed by: jeff Approved by: sbruno	2018-05-17 03:19:31 +00:00
Mark Johnston	36f8fe9bbb	Get rid of vm_pageout_page_queued(). vm_page_queue(), added in r333256, generalizes vm_pageout_page_queued(), so use it instead. No functional change intended. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D15402	2018-05-13 13:00:59 +00:00
Mateusz Guzik	782e38aa48	uma: increase alignment to 128 bytes on amd64 Current UMA internals are not suited for efficient operation in multi-socket environments. In particular there is very common use of MAXCPU arrays and other fields which are not always properly aligned and are not local for target threads (apart from the first node of course). Turns out the existing UMA_ALIGN macro can be used to mostly work around the problem until the code get fixed. The current setting of 64 bytes runs into trouble when adjacent cache line prefetcher gets to work. An example 128-way benchmark doing a lot of malloc/frees has the following instruction samples: before: kernel`lf_advlockasync+0x43b 32940 kernel`malloc+0xe5 42380 kernel`bzero+0x19 47798 kernel`spinlock_exit+0x26 60423 kernel`0xffffffff80 78238 0x0 136947 kernel`uma_zfree_arg+0x46 159594 kernel`uma_zalloc_arg+0x672 180556 kernel`uma_zfree_arg+0x2a 459923 kernel`uma_zalloc_arg+0x5ec 489910 after: kernel`bzero+0xd 46115 kernel`lf_advlockasync+0x25f 46134 kernel`lf_advlockasync+0x38a 49078 kernel`fget_unlocked+0xd1 49942 kernel`lf_advlockasync+0x43b 55392 kernel`copyin+0x4a 56963 kernel`bzero+0x19 81983 kernel`spinlock_exit+0x26 91889 kernel`0xffffffff80 136357 0x0 239424 See the review for more details. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D15346	2018-05-11 07:04:57 +00:00
Mark Johnston	1b5c869d64	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280	2018-05-04 17:17:30 +00:00
Konstantin Belousov	a7163bb962	Eliminate some vm object relocks in vm fault. For the vm_fault_prefault() call from vm_fault_soft_fast(), extend the scope of the object rlock to avoid re-taking it inside vm_fault_prefault(). It causes pmap_enter_quick() sometimes called with shadow object lock as well as the page lock, but this looks innocent. Noted and measured by: mjg Reviewed by: alc, markj (as part of the larger patch) Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15122	2018-04-29 12:43:08 +00:00
Mateusz Guzik	e825ab8d89	uma: whack main zone counter update in the slow path Cached counters are typically zero at this point so it performs avoidable atomics. Everything reading them also reads the cached ones, thus there is really no point. Reviewed by: jeff	2018-04-27 05:37:35 +00:00
Mateusz Guzik	23e17f83f1	vm: move vm_cnt to __read_mostly now that it is not written to While here whack unused locking keys for the struct. Discussed with: jeff	2018-04-27 05:36:02 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Mark Johnston	7e28037a09	Add a UMA zone flag to disable the use of buckets. This allows the creation of zones which don't do any caching in front of the keg. If the zone is a cache zone, this means that UMA will not attempt any memory allocations when allocating an item from the backend. This is intended for use after a panic by netdump, but likely has other applications. Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15184	2018-04-24 20:05:45 +00:00
Mark Johnston	64b3893010	Initialize marker pages in vm_page_domain_init(). They were previously initialized by the corresponding page daemon threads, but for vmd_inacthead this may be too late if vm_page_deactivate_noreuse() is called during boot. Reported and tested by: cperciva Reviewed by: alc, kib MFC after: 1 week	2018-04-19 14:09:44 +00:00
Mark Johnston	9de8fcfddf	Ensure that m and skip_m belong to the same object. Pages allocated from a given reservation may belong to different objects. It is therefore possible for vm_page_ps_test() to be called with the base page's object unlocked. Check for this case before asserting that the object lock is held. Reported by: jhb Reviewed by: kib MFC after: 1 week	2018-04-17 18:49:17 +00:00
Konstantin Belousov	e55d32b7b3	Handle Skylake-X errata SKZ63. SKZ63 Processor May Hang When Executing Code In an HLE Transaction Region Problem: Under certain conditions, if the processor acquires an HLE (Hardware Lock Elision) lock via the XACQUIRE instruction in the Host Physical Address range between 40000000H and 403FFFFFH, it may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS. Move the pages from the range into the blacklist. Add a tunable to not waste 4M if local DoS is not the issue. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15001	2018-04-07 17:06:13 +00:00
Brooks Davis	6469bdcdb6	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
Mark Johnston	c098768e4d	Ensure the background laundering threshold is positive after a scan. The division added in r331732 meant that we wouldn't attempt a background laundering until at least v_free_target - v_free_min clean pages had been freed by the page daemon since the last laundering. If the inactive queue is depleted but not completely empty (e.g., because it contains busy pages), it can thus take a long time to meet this threshold. Restore the pre-r331732 behaviour of using a non-zero background laundering threshold if at least one inactive queue scan has elapsed since the last attempt at background laundering. Submitted by: tijl (original version)	2018-04-02 15:07:41 +00:00
Gleb Smirnoff	b92b26ad08	Use UMA_SLAB_SPACE macro. No functional change here.	2018-04-02 05:15:25 +00:00
Gleb Smirnoff	96a10340ce	In uma_startup_count() handle special case when zone will fit into single slab, but with alignment adjustment it won't. Again, when there is only one item in a slab alignment can be ignored. See previous revision of this file for more info. PR: 227116	2018-04-02 05:14:31 +00:00
Gleb Smirnoff	1ca6ed4589	Handle a special case when a slab can fit only one allocation, and zone has a large alignment. With alignment taken into account uk_rsize will be greater than space in a slab. However, since we have only one item per slab, it is always naturally aligned. Code that will panic before this change with 4k page: z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0); uma_zalloc(z, M_WAITOK); A practical scenario to hit the panic is a machine with 56 CPUs and 2 NUMA domains, which yields in zone size of 3984. PR: 227116 MFC after: 2 weeks	2018-04-02 05:11:59 +00:00
Jeff Roberson	c33e3a642b	Add a uma cache of free pages in the DEFAULT freepool. This gives us per-cpu alloc and free of pages. The cache is filled with as few trips to the phys allocator as possible by the use of a new vm_phys_alloc_npages() function which allocates as many as N pages. This code was originally by markj with the import function rewritten by me. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14905	2018-04-01 04:50:05 +00:00
Jeff Roberson	e8bb2dc7c9	Add the flag ZONE_NOBUCKETCACHE. This flag instructions UMA not to keep a cache of fully populated buckets. This will be used in a follow-on commit. The flag idea was originally from markj. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-04-01 04:47:05 +00:00
Konstantin Belousov	19ea042eb8	Make vm_map_max/min/pmap KBI stable. There are out of tree consumers of vm_map_min() and vm_map_max(), and I believe there are consumers of vm_map_pmap(), although the later is arguably less in the need of KBI-stable interface. For the consumers benefit, make modules using this KPI not depended on the struct vm_map layout. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14902	2018-03-30 10:55:31 +00:00
Mark Johnston	6068486258	Fix the background laundering mechanism after r329882. Rather than using the number of inactive queue scans as a metric for how many clean pages are being freed by the page daemon, have the page daemon keep a running counter of the number of pages it has freed, and have the laundry thread use that when computing the background laundering threshold. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14884	2018-03-29 14:27:40 +00:00
Jeff Roberson	e5818a53db	Implement several enhancements to NUMA policies. Add a new "interleave" allocation policy which stripes pages across domains with a stride or width keeping contiguity within a multi-page region. Move the kernel to the dedicated numbered cpuset #2 making it possible to assign kernel threads and memory policy separately from user. This also eliminates the need for the complicated interrupt binding code. Add a sysctl API for viewing and manipulating domainsets. Refactor some of the cpuset_t manipulation code using the generic bitset type so that it can be used for both. This probably belongs in a dedicated subr file. Attempt to improve the include situation. Reviewed by: kib Discussed with: jhb (cpuset parts) Tested by: pho (before review feedback) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14839	2018-03-29 02:54:50 +00:00
Jeff Roberson	146bf2c66d	Move vm_ndomains to vm.h where it can be used with a single header include rather than requiring a half-dozen. Many non-vm files may want to know the number of valid domains. Sponsored by: Netflix, Dell/EMC Isilon	2018-03-27 03:27:02 +00:00
Konstantin Belousov	8ec533d336	Allow to specify for vm_fault_quick_hold_pages() that nofault mode should be honored. We must not sleep or acquire any MI VM locks if TDP_NOFAULTING is specified. On the other hand, there were some callers in the tree which set TDP_NOFAULTING for larger scope than needed, I fixed the code which I wrote, but I suspect that linuxkpi and out of tree drm drivers might abuse this still. So only enable the mode for vm_fault_quick_hold_pages() where vm_fault_hold() is not called when specifically asked by user. I decided to use vm_prot_t flag to not change KPI. Since number of flags in vm_prot_t is limited, I reused the same flag which was already consumed for vm_map_lookup(). Reported and tested by: pho (as part of the larger patch) Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14825	2018-03-26 16:31:12 +00:00
Konstantin Belousov	ed9e8bc468	Account the size of the vslock-ed memory by the thread. Assert that all such memory is unwired on return to usermode. The count of the wired memory will be used to detect the copyout mode. Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:51:27 +00:00
Konstantin Belousov	63b5d112b6	For vm_zone_stats() sysctl handler, do not drain sbuf calling copyout(9) while owning zone lock. Despite old value sysctl buffer is wired, spurious faults might still occur. Note that we still own the uma_rwlock there, but this lock does not participate in sensitive lock orders. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:48:53 +00:00
Jeff Roberson	2d3f4181de	Fix two compliation problems on non-amd64 architectures.	2018-03-23 18:24:02 +00:00
Mark Johnston	4046851367	Correct a couple of assertion messages in vm_page_reclaim_run(). MFC after: 3 days	2018-03-23 14:38:56 +00:00
Cy Schubert	72346b2232	Fix build on i386 without INVARIANTS following r331369. --- vm_reserv.o --- In file included from /opt/src/svn-current/sys/vm/vm_reserv.c:48: In file included from /opt/src/svn-current/sys/sys/counter.h:37: ./machine/counter.h:174:3: error: implicit declaration of function 'critical_enter' is invalid in C99 [-Werror,-Wimplicit-function-declarat ion] critical_enter(); Reviewed by: jeff@	2018-03-23 03:22:30 +00:00
Jeff Roberson	5c930c894d	Lock reservations with a dedicated lock in each reservation. Protect the vmd_free_count with atomics. This allows us to allocate and free from reservations without the free lock except where a superpage is allocated from the physical layer, which is roughly 1/512 of the operations on amd64. Use the counter api to eliminate cache conention on counters. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14707	2018-03-22 19:21:11 +00:00
Jeff Roberson	9a4b4cd3bc	Start witness much earlier in boot so that we can shrink the pend list and make it more immune to further change. Reviewed by: markj, imp (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:11:43 +00:00
Jeff Roberson	cdfeced8ff	Use read_mostly and alignment tags to eliminate or limit false sharing. Reviewed by: markj (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:06:50 +00:00
Konstantin Belousov	79e9552ebb	Check for wrap-around in vm_phys_alloc_seg_contig(). It is possible to provide insane values for size in contigmalloc(9) request, which usually not reaches the phys allocator due to failing KVA allocation. But with the forthcoming 4/4 i386, where 32bit architecture has almost 4G KVA, contigmalloc(1G) is not unreasonable outright and KVA might be available sometimes. Then, the calculation of pa_end could wrap around, depending on the physical address, and the checks in vm_phys_alloc_seg_contig() would pass while the iteration in the loop after the 'done' label goes out of the vm_page_array bounds. Fix it by detecting the wrap. Reported and tested by: pho Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14767	2018-03-20 16:17:55 +00:00
Mark Johnston	c6a70eaea8	Avoid dequeuing the fault page during a soft fault. Such pages are re-enqueued at the end of the fault handler, preserving LRU. Rather than performing two separate operations per fault, simply requeue the page at the end of the fault (or bump its activation count if it resides in PQ_ACTIVE, avoiding the page queue lock entirely). This elides some page lock and page queue lock operations in common cases, e.g., CoW faults. Note that we must still dequeue the source page for "optimized" CoW faults since the page may not remain enqueued while it is moved to another object. Reviewed by: alc, kib Tested by: pho MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:49:30 +00:00
Mark Johnston	0eb50f9cd2	Have vm_page_{deactivate,launder}() requeue already-queued pages. In many cases the page is not enqueued so the change will have no effect. However, the change is needed to support an optimization in the fault handler and in some cases (sendfile, the buffer cache) it was being emulated by the caller anyway. Reviewed by: alc Tested by: pho MFC after: 2 weeks X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:40:56 +00:00
Mark Johnston	434862acb1	Have vm_page_replace() assert that the new page is not enqueued. The new page does not belong to a VM object, but the page daemon does not expect to encounter such pages. Reviewed by: alc, kib Tested by: pho MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D14625	2018-03-18 16:35:40 +00:00
Conrad Meyer	5d3b36666b	Fix GCC build: Remove redundant pagedaemon_wakeup declaration Introduced in r331018. Reported by: kevans Sponsored by: Dell EMC Isilon	2018-03-16 07:05:09 +00:00
Jeff Roberson	30fbfdda6c	Eliminate pageout wakeup races. Take another step towards lockless vmd_free_count manipulation. Reduce the scope of the free lock by using a pageout lock to synchronize sleep and wakeup. Only trigger the pageout daemon on transitions between states. Drive all wakeup operations directly as side-effects from freeing memory rather than requiring an additional function call. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14612	2018-03-15 19:23:07 +00:00
Konstantin Belousov	741e1c9196	Revert the chunk from r330410 in vm_page_reclaim_run(). There, the pages freed might be managed but the page's lock is not owned. For KPI correctness, the page lock is requried around the call to vm_page_free_prep(), which is asserted. Reclaim loop already did the work which could be done by vm_page_free_prep(), so the lock is not needed and the only consequence of not owning it is the assert trigger. Instead of adding the locking to satisfy the assert, revert to the code that calls vm_page_free_phys() directly. Reported by: pho Discussed with: jeff Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-13 18:27:23 +00:00
Jeff Roberson	f4af595964	Don't assert that the domain free lock is held until we're certain that there is a valid reservation. This can trip erroneously when memory falls within a domain but doesn't have the reservation initialized because it does not meet size or alignment requirements. Reported by: pho, mjg Sponsored by: Netflix, Dell/EMC Isilon	2018-03-07 22:04:27 +00:00
Konstantin Belousov	2a8e8f7892	Remove redundant test from r330410. If the input slist is non-empty, counter cannot be zero after freeing. Noted by: mjg MFC after: 2 weeks	2018-03-04 21:15:31 +00:00
Konstantin Belousov	8c8ee2ee1c	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485	2018-03-04 20:53:20 +00:00
Mark Johnston	3b8cf4acf0	Give the 0th domain's page daemon thread a consistent name. Page daemon threads for other domains show up in ps(1) output as "pagedaemon/domN", so let that be the case for domain 0 as well. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14518	2018-02-27 16:51:09 +00:00
Mark Johnston	59d3150b58	Restore the pre-r329882 inactive page shortage computation. With r329882, in the absence of a free page shortage we would only take len(PQ_INACTIVE)+len(PQ_LAUNDRY) into account when deciding whether to aggressively scan PQ_ACTIVE. Previously we would also include the number of free pages in this computation, ensuring that we wouldn't scan PQ_ACTIVE with plenty of free memory available. The change in behaviour was most noticeable immediately after booting, when PQ_INACTIVE and PQ_LAUNDRY are nearly empty. Reviewed by: jeff	2018-02-24 20:47:22 +00:00
Konstantin Belousov	cd84455f91	Hide all vm/vm_pageout.h content under #ifdef _KERNEL. There are no parts useful for usermode applications in vm/vm_pageout.h. Even for the specific applications like fstat and lsof. In my opinion, this protection is redundant and instead userspace should not include the header at all. Since there are apparently broken third party codebases, give them a bit of slack by providing transitional period. Reported by: julian Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-24 10:26:26 +00:00
Mark Johnston	5f70fb1425	Correct some comments after r328954. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14486	2018-02-23 23:27:53 +00:00
Mark Johnston	9140bff7ed	Remove a bogus assertion from vm_page_launder(). After r328977, a wired page m may have m->queue != PQ_NONE. Reviewed by: kib X-MFC with: r328977 Differential Revision: https://reviews.freebsd.org/D14485	2018-02-23 23:25:22 +00:00
Jeff Roberson	5f8cd1c0bf	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402	2018-02-23 22:51:51 +00:00
Konstantin Belousov	2c0f13aa59	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
Mark Johnston	3f060b60b1	Use the conventional name for an array of pages. No functional change intended. Discussed with: kib MFC after: 3 days	2018-02-16 15:38:22 +00:00
Konstantin Belousov	ada27a3bb8	Cleanup unused page argument for vm_reserv_break(). Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14364	2018-02-14 00:34:02 +00:00
Konstantin Belousov	d929ad7f91	Ensure memory consistency on COW. From the submitter description: The process is forked transitioning a map entry to COW Thread A writes to a page on the map entry, faults, updates the pmap to writable at a new phys addr, and starts TLB invalidations... Thread B acquires a lock, writes to a location on the new phys addr, and releases the lock Thread C acquires the lock, reads from the location on the old phys addr... Thread A ...continues the TLB invalidations which are completed Thread C ...reads from the location on the new phys addr, and releases the lock In this example Thread B and C [lock, use and unlock] properly and neither own the lock at the same time. Thread A was writing somewhere else on the page and so never had/needed the lock. Thread C sees a location that is only ever read\|modified under a lock change beneath it while it is the lock owner. To fix this, perform the two-stage update of the copied PTE. First, the PTE is updated with the address of the new physical page with copied content, but in read-only mode. The pmap locking and the page busy state during PTE update and TLB invalidation IPIs ensure that any writer to the page cannot upgrade the PTE to the writable state until all CPUs updated their TLB to not cache old mapping. Then, after the busy state of the page is lifted, the faults for write can proceed and do not violate the consistency of the reads. The change is done in vm_fault because most architectures do need IPIs to invalidate remote TLBs. More, I think that hardware guarantees of atomicity of the remote TLB invalidation are not enough to prevent the inconsistent reads of non-atomic reads, like multi-word accesses protected by a lock. So instead of modifying each pmap invalidation code, I did it there. Discovered and analyzed by: Elliott.Rabe@dell.com Reviewed by: markj PR: 225584 (appeared to have the same cause) Tested by: Elliott.Rabe@dell.com, emaste, Mike Tancsa <mike@sentex.net>, truckman Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:31:45 +00:00
Konstantin Belousov	607970bc8e	Do not call pmap_enter() with invalid protection mode. If the map entry elookup was performed due to the mapping changes, we need to ensure that there is still some access permission bit requested which is compatible with the current vm_map_entry mode. If not, restart the handler from scratch instead of trying to save the current progress. Also adjust fault_type to not include cleared permission bits. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14347	2018-02-14 00:25:18 +00:00
Konstantin Belousov	c4be9169c0	Do not leak rv->psind in some specific situations. Suppose that we have an object with a mapped superpage, and that all pages in the superpages are held (by some driver). Additionally, suppose that the object is terminated, e.g. because the only process mapping it is exiting. Then the reservation is broken, but the pages cannot be freed until later, when they are unheld. In this situation, the reservation code cannot clean psind, since no pages are freed, and the page is freed and then reused with invalid psind. Clean psind on vm_reserv_break() to avoid the situation. Reported and tested by: Slava Shwartsman Reviewed by: markj Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14335	2018-02-13 15:36:28 +00:00
Jeff Roberson	e958ad4cf3	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
Gleb Smirnoff	f7d3578564	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
Gleb Smirnoff	5073a08328	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
Mark Johnston	1d3a1bcfac	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943	2018-02-07 16:57:10 +00:00
Gleb Smirnoff	d2be4a1e4f	Use correct arithmetic to calculate how many pages we need for kegs and hashes. There is no functional change with current sizes.	2018-02-06 22:13:40 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Gleb Smirnoff	1616767dfc	Improve DIAGNOSTIC printf. Report using a boot page every time regardless of booted status.	2018-02-06 22:08:43 +00:00
Gleb Smirnoff	ae941b1b4e	Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC. o Call uma_startup1() after initializing kmem, vmem and domains. o Include 8 eight VM startup pages into uma_startup_count() calculation. o Account for vmem_startup() and vm_map_startup() preallocating pages. o Account for extra two allocations done by kmem_init() and vmem_create(). o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT allowed several other SYSINITs to sneak in before it, thus bumping requirement for amount of boot pages.	2018-02-06 22:06:59 +00:00
Mark Johnston	4d2653522d	Delete a declaration for a variable removed in r305362.	2018-02-06 17:26:11 +00:00
Gleb Smirnoff	f4bef67c9c	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
Konstantin Belousov	20e4afbfbb	On munlock(), unwire correct page. It is possible, for complex fork()/collapse situations, to have sibling address spaces to partially share shadow chains. If one sibling performs wiring, it can happen that a transient page, invalid and busy, is installed into a shadow object which is visible to other sibling for the duration of vm_fault_hold(). When the backing object contains the valid page, and the wiring is performed on read-only entry, the transient page is eventually removed. But the sibling which observed the transient page might perform the unwire, executing vm_object_unwire(). There, the first page found in the shadow chain is considered as the page that was wired for the mapping. It is really the page below it which is wired. So we unwire the wrong page, either triggering the asserts of breaking the page' wire counter. As the fix, wait for the busy state to finish if we find such page during unwire, and restart the shadow chain walk after the sleep. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14184	2018-02-05 12:49:20 +00:00
Konstantin Belousov	938cdc4264	On pageout, in vnode generic pager, for partially dirty page, only clear dirty bits for completely invalid blocks. Otherwise we might not write out the last chunk that is shorter than 512 bytes, if the file end is not aligned on disk block boundary. This become important after the r324794. PR: 225586 Reported by: tris_vern@hotmail.com Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-02-02 11:56:30 +00:00
Konstantin Belousov	1c5196c3ed	Assign map->header values to avoid boundary checks. In several places, entry start and end field are checked, after excluding the possibility that the entry is map->header. By assigning max and min values to the start and end fields of map->header in vm_map_init, the explicit map->header checks become unnecessary. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: alc, kib, markj (previous version) Tested by: pho (previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13735	2018-01-20 12:19:02 +00:00
Nathan Whitehorn	9a8196ce19	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
Jeff Roberson	b6715dab8f	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
Jeff Roberson	6f4acaf4c9	Add support for NUMA domains to bus dma tags. This causes all memory allocated with a tag to come from the specified domain if it meets the other constraints provided by the tag. Automatically create a tag at the root of each bus specifying the domain local to that bus if available. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13545	2018-01-12 23:34:16 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Jeff Roberson	7a469c8ef3	Implement NUMA policy for kmem_*(9). This maintains compatibility with reservations by giving each memory domain its own KVA space in vmem that is naturally aligned on superpage boundaries. Reviewed by: alc, markj, kib (some objections) Sponsored by: Netflix, Dell/EMC Isilon Tested by; pho Differential Revision: https://reviews.freebsd.org/D13289	2018-01-12 23:13:55 +00:00
Jeff Roberson	7b11a48326	Add files for r327895 Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:57:57 +00:00
Jeff Roberson	3f289c3fcf	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
Ed Maste	d03890153d	ANSIfy function definitions in sys/vm/	2018-01-12 03:50:44 +00:00
Konstantin Belousov	33937731e7	Restructure swapout tests after vm map locking was removed. Consolidate the regions covered by the process lock. Combine similar conditions tests into one, e.g. all process flags can be test with one logical operation. Add check for in-exec state, since p_vmspace is dererenced. Remove labels and goto by explicitly tracking state. Update comments. Reviewed by: alc, markj (previous version) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13693	2018-01-04 18:14:58 +00:00
Alan Cox	36ca312db5	Once we have decided to swap out a process, don't delay the laundering of its per-thread kernel stack pages by making them pass through the inactive queue first. Instead, immediately place them in the laundry so that they might be cleaned and made available for reclamation sooner. Reviewed by: kib, markj MFC after: 1 week	2018-01-04 03:16:32 +00:00
Jeff Roberson	ad5b0f5b51	Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494	2018-01-02 04:35:56 +00:00
Konstantin Belousov	9997c0481c	Do not let vm_daemon run unbounded. On a load where single anonymous object consumes almost all memory on the large system, swapout code executes the iteration over the corresponding object page queue for long time, owning the map and object locks. This blocks pagedaemon which tries to lock the object, and blocks other threads in the process in vm_fault() waiting for the map lock. Handle the issue by terminating the deactivation loop if we executed too long and by yielding at the top level in vm_daemon. Reported by: peterj, pho Reviewed by: alc Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2018-01-01 19:27:33 +00:00
Alan Cox	7000c58871	The variable "minslptime" is pointless and always has been, ever since its introduction in r83366. (At that time, this code appeared in vm/vm_glue.c, because vm/vm_swapout.c did not exist.) When the FOREACH_THREAD loop completes, we know that the sleep time for every thread is above whichever threshold is being applied. Reviewed by: kib X-MFC with: r327354	2017-12-31 21:36:42 +00:00
Alan Cox	4abca9bb05	Previously, swap_pager_copy() freed swap blocks one at at time, via swp_pager_meta_ctl(), with no opportunity to recognize freeing of consecutive blocks and free fewer block ranges. To open that opportunity, this change removes the SWM_FREE option from swp_pager_meta_ctl(), and compels the caller to do the freeing when a valid block address is returned. In swap_pager_copy(), these frees are aggregated, so that a sequence of them can be done at one time. The only other caller to swp_pager_meta_ctl() that passed SWM_FREE, swp_pager_unswapped(), is also modified to handle its single free explicitly. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib (an earlier version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13290	2017-12-31 04:01:47 +00:00
Konstantin Belousov	b6eabc36ba	Do not lock vm map in swapout_procs(). Neither swapout_procs() nor swapout() access the map. Since the process' vmspace is referenced only to obtain the pointer to the vm_map, the reference is not needed as well. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13681	2017-12-29 20:33:56 +00:00
Konstantin Belousov	e258b4a0dc	Style. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13678	2017-12-29 19:05:07 +00:00
Alan Cox	5c515efc88	After r327168, the variable "vm_pageout_wanted" can be static. MFC after: 2 weeks	2017-12-29 17:02:22 +00:00
Konstantin Belousov	89c0e67db5	Clean up the comment. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2017-12-28 23:50:21 +00:00
Konstantin Belousov	0080a8fa95	In vm_swapout_map_deactivate_pages(), it is enough to lock the map for read. Reviewed by: alc, markj (as part of the larger patch) Tested by: pho (again, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13671	2017-12-28 22:56:30 +00:00
Alan Cox	fec296887f	Refactor vm_map_find(), creating a separate function, vm_map_alignspace(), for finding aligned free space in the given map. With this change, we always return KERN_NO_SPACE when we fail to find free space. Whereas, previously, we might return KERN_INVALID_ADDRESS. Also, with this change, we explicitly check for address wrap, rather than relying upon the map's min and max addresses to establish sentinel-like regions. This refactoring was inspired by the problem that we addressed in r326098. Reviewed by: kib Tested by: pho Discussed with: markj MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D13346	2017-12-26 17:59:37 +00:00
Mark Johnston	65ef323137	Ensure that pass > 0 when starting a scan with vm_pages_needed == 1. Otherwise the page daemon will not reclaim pages and thus will not wake threads sleeping in VM_WAIT. Reported and tested by: pho Reviewed by: alc, kib X-MFC with: r327168 Differential Revision: https://reviews.freebsd.org/D13640	2017-12-26 16:29:39 +00:00
Alan Cox	115423761e	Make the vm object bypass and collapse counters per CPU. Requested by: mjg Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13611	2017-12-25 19:36:04 +00:00
Mark Johnston	280d15cd0a	Fix two problems with the page daemon control loop. Both issues caused the page daemon to erroneously go to sleep when applications are consuming free pages at a high rate, leaving the application threads blocked in VM_WAIT. 1) After completing an inactive queue scan, concurrent allocations may have prevented the page daemon from meeting the v_free_min threshold. In this case, the page daemon was going to sleep even when the inactive queue contained plenty of clean pages. 2) pagedaemon_wakeup() may be called without the free queues lock held. This can lead to a lost wakeup if a call occurs after the page daemon clears vm_pageout_wanted but before going to sleep. Fix 1) by ensuring that we start a new inactive queue scan immediately if v_free_count < v_free_min after a prior scan. Fix 2) by adding a new subroutine, pagedaemon_wait(), called from vm_wait() and vm_waitpfault(). It wakes up the page daemon if either vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps on v_free_count. Reported by: jeff Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13424	2017-12-24 19:45:16 +00:00
Konstantin Belousov	200f8117ba	Perform all accesses to uma_reclaim_needed using atomic(9) KPI. Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 10:06:55 +00:00
Mark Johnston	cb35676e66	Use a dedicated counter for inactive queue scans. The laundry thread keeps track of the number of inactive queue scans performed by the page daemon, and was previously using the v_pdwakeups counter to count them. However, in some cases the inactive queue may be scanned multiple times after a single wakeup, so it's more accurate to use a dedicated counter. Reviewed by: alc, kib (previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13422	2017-12-11 15:33:24 +00:00
Mark Johnston	82e2d06a27	Fix the act_scan_laundry_weight mechanism. r292392 modified the active queue scan to weigh clean pages differently from dirty pages when attempting to meet the inactive queue target. When r306706 was merged into the PQ_LAUNDRY branch, this mechanism was broken. Fix it by scalaing the correct page shortage variable. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13423	2017-12-09 15:47:26 +00:00
Mark Johnston	952a29c04b	Fix the UMA reclaim worker after r326347. atomic_set_*() sets a bit in the target memory location, so atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like it does. PR: 224080 Reviewed by: jeff, kib Differential Revision: https://reviews.freebsd.org/D13412	2017-12-07 19:38:09 +00:00
Mark Johnston	6eebec8343	Use unique wait messages in the page daemon control loop. Discussed with: alc MFC after: 1 week	2017-12-06 18:36:54 +00:00
Andrew Turner	5be9377857	Print the correct value when freelist is out of range. Security: : Sponsored by: DARPA, AFRL	2017-12-04 11:16:51 +00:00
Michael Zhilin	0db2102aaa	[mips] [vm] restore translation of freelist to flind for page allocation Commit r326346 moved domain iterators from physical layer to vm_page one, but it also removed translation of freelist to flind for vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter, but after it expect freelist index. On small WiFi boxes with few megabytes of RAM, there is only one freelist VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file sys/mips/include/vmparam.h). It results in freelist 1 with flind 0. At first, this commit renames flind to freelist in vm_page_alloc_freelist to avoid misunderstanding about input parameters. Then on physical layer it restores translation for correct handling of freelist parameter. Reported by: landonf Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D13351	2017-12-04 08:08:55 +00:00
Konstantin Belousov	e8502826ce	Add comment for vm_map_find_min(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 days X-Differential revision: https://reviews.freebsd.org/D13155	2017-12-01 10:53:08 +00:00
Pedro F. Giffuni	796df753f4	SPDX: Consider code from Carnegie-Mellon University. Interesting cases, most likely from CMU Mach sources.	2017-11-30 15:48:35 +00:00
Pedro F. Giffuni	cf3329887e	SPDX: wrong license.	2017-11-30 15:45:42 +00:00
Mark Johnston	57cd81a357	Verify the object/vnode association after vget() in vm_pageout_clean(). It's theoretically possible for the vnode and object to be disassociated while locks are dropped around the vget() call, in which case we shouldn't proceed with laundering. Noted and reviewed by: kib MFC after: 1 week	2017-11-29 19:47:09 +00:00
Mark Johnston	1084894f80	Remove some comments that became incorrect with r325530.	2017-11-29 14:34:05 +00:00
Jeff Roberson	2e47807c21	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187	2017-11-28 23:40:54 +00:00
Jeff Roberson	ef435ae7de	Move domain iterators into the page layer where domain selection should take place. This makes the majority of the phys layer explicitly domain specific. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix & Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13014	2017-11-28 23:18:35 +00:00
Alan Cox	230869e051	When the swap pager allocates space on disk, it requests contiguous blocks in a single call to blist_alloc(). However, when it frees that space, it previously called blist_free() on each block, one at a time. With this change, the swap pager identifies ranges of contiguous blocks to be freed, and calls blist_free() once per range. In one extreme case, that is described in the review, the time to perform an munmap(2) was reduced by 55%. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12397	2017-11-28 17:46:03 +00:00
Mark Johnston	d2b677cef6	Avoid unnecessary lookups when initializing the vm_page array. This gives a marginal improvement in the vm_page_array initialization time. Also garbage-collect the now-unused vm_phys_paddr_to_segind(). Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13270	2017-11-27 17:46:38 +00:00
Pedro F. Giffuni	fe267a5590	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.	2017-11-27 15:23:17 +00:00
Mark Johnston	b20bf182e6	Move vm_phys_init_page() to vm_page.c. Suggested by: kib Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13250	2017-11-26 19:17:55 +00:00
Mark Johnston	830cb6b2b6	Remove unneeded initializations from vm_phys_init_page(). The page allocator always initializes the aflags and oflags fields. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13242	2017-11-26 19:16:45 +00:00
Konstantin Belousov	9410cd7d9e	Return different error code for the guard page layout violation. On KERN_NO_SPACE error, as it is returned now, vm_map_find() continues the loop searching for the suitable range for the requested mapping with specific alignment. Since the vm_map_findspace() succesfully finds the same place, the loop never ends. The errors returned from vm_map_stack() completely repeat the behavior of vm_map_insert() now, as suggested by Alan. Reported by: Arto Pekkanen <aksyom@gmail.com> PR: 223732 Reviewed by: alc, markj Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D13186	2017-11-22 16:45:27 +00:00
Alan Cox	4d572bb3ed	When vm_map_find(find_space = VMFS_OPTIMAL_SPACE) fails to find space, a second scan of the address space with find_space = VMFS_ANY_SPACE is performed. Previously, vm_map_find() released and reacquired the map lock between the first and second scans. However, there is no compelling reason to do so. This revision modifies vm_map_find() to retain the map lock. Reviewed by: jhb, kib, markj MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D13155	2017-11-22 16:39:24 +00:00
Mark Johnston	5070d56d41	Allow for fictitious physical pages in vm_page_scan_contig(). Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter fictitous pages. For now, allow for this possibility; such pages will be skipped later in the scan since they are wired. Reported by: avg Reviewed by: kib MFC after: 1 week	2017-11-21 13:17:40 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Pedro F. Giffuni	df57947f08	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
Konstantin Belousov	1c778d91b5	vmtotal: extend memory counters to accomodate for current and future hardware sizes. 32bit counters already overflow on approachable virtual memory page counts, and soon would overflow on the physical pages counts as well. Bump sizes to 64bit types. Bump __FreeBSD_version. It is impossible to provide perfect backward ABI compat for this change. If a program requests an old structure, it can be detected by size. But if it queries the size first by passing NULL old req pointer, there is almost nothing we can do to detect the desired ABI. As a partial solution, check p_osrel of the quering process when selecting the size to report. Submitted by: Pawel Biernacki <pawel.biernacki@gmail.com> Differential revision: https://reviews.freebsd.org/D13018	2017-11-15 13:41:03 +00:00
Konstantin Belousov	772c8b6749	Fix operator priority. Sponsored by: The FreeBSD Foundation	2017-11-08 23:25:05 +00:00

... 2 3 4 5 6 ...

4061 Commits