freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	17fbf3cf34	Add a !NUMA definition for vm_domainset_iter_policy_ref_init(). Pointy hat: markj X-MFC with: r339661 Sponsored by: The FreeBSD Foundation	2018-10-24 17:09:20 +00:00
Mark Johnston	7571e24901	Add an #include required after r339686. X-MFC with: r339686 Sponsored by: The FreeBSD Foundation	2018-10-24 16:49:16 +00:00
Mark Johnston	194a979ee9	Use a vm_domainset iterator in keg_fetch_slab(). Previously, it used a hand-rolled round-robin iterator. This meant that the minskip logic in r338507 didn't apply to UMA allocations, and also meant that we would call vm_wait() for individual domains rather than permitting an allocation from any domain with sufficient free pages. Discussed with: jeff Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17420	2018-10-24 16:41:47 +00:00
Mark Johnston	87ab1a10b1	Initialize static domainsets regardless of whether an SRAT is present. Reported by: yuripv X-MFC with: r339452 Sponsored by: The FreeBSD Foundation	2018-10-23 18:07:16 +00:00
Mark Johnston	4c29d2de67	Refactor domainset iterators for use by malloc(9) and UMA. Before this change we had two flavours of vm_domainset iterators: "page" and "malloc". The latter was only used for kmem_() and hard-coded its behaviour based on kernel_object's policy. Moreover, its use contained a race similar to that fixed by r338755 since the kernel_object's iterator was being run without the object lock. In some cases it is useful to be able to explicitly specify a policy (domainset) or policy+iterator (domainset_ref) when performing memory allocations. To that end, refactor the vm_dominset_ KPI to permit this, and get rid of the "malloc" domainset_iter KPI in the process. Reviewed by: jeff (previous version) Tested by: pho (part of a larger patch) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17417	2018-10-23 16:35:58 +00:00
Mark Johnston	b61f314290	Make it possible to disable NUMA support with a tunable. This provides a chicken switch for anyone negatively impacted by enabling NUMA in the amd64 GENERIC kernel configuration. With NUMA disabled at boot-time, information about the NUMA topology is not exposed to the rest of the kernel, and all of physical memory is viewed as coming from a single domain. This method still has some performance overhead relative to disabling NUMA support at compile time. PR: 231460 Reviewed by: alc, gallatin, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17439	2018-10-22 20:13:51 +00:00
Mark Johnston	2801dd08d7	Fix the build after r339601. I committed some patches out of order and didn't build-test one of them. Reported by: Jenkins, O. Hartmann <ohartmann@walstatt.org> X-MFC with: r339601	2018-10-22 17:19:48 +00:00
Mark Johnston	2a843ae7d9	Avoid a redundancy in a comment updated by r339601. Reported by: alc X-MFC with: r339601	2018-10-22 17:17:30 +00:00
Mark Johnston	b00581965d	Swap in processes unless there's a global memory shortage. On NUMA systems, we would not swap in processes unless all domains had some free pages. This is too conservative in general. Instead, permit swapins so long as at least one domain has free pages, and add a kernel stack NUMA policy which ensures that we will try to allocate kernel stack pages from any domain. Reported and tested by: pho, Jan Bramkamp <crest@bultmann.eu> Reviewed by: alc, kib Discussed with: jeff MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17304	2018-10-22 17:04:04 +00:00
Gleb Smirnoff	81c0d72c60	If we lost race or were migrated during bucket allocation for the per-CPU cache, then we put new bucket on generic bucket cache. However, code didn't honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache on a zone that clearly forbids that. Fix this. Reviewed by: markj	2018-10-22 15:48:07 +00:00
Konstantin Belousov	17afd2beec	Unindent vm_map_simplify_entry() after r339506. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17632	2018-10-21 00:11:56 +00:00
Konstantin Belousov	074244628b	Reduce code duplication in merging vm_entry neighbors. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17610	2018-10-20 23:08:04 +00:00
Mark Johnston	662e7fa8d9	Create some global domainsets and refactor NUMA registration. Pre-defined policies are useful when integrating the domainset(9) policy machinery into various kernel memory allocators. The refactoring will make it easier to add NUMA support for other architectures. No functional change intended. Reviewed by: alc, gallatin, jeff, kib Tested by: pho (part of a larger patch) MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17416	2018-10-20 17:36:00 +00:00
Matt Macy	e8bb589d56	eliminate locking surrounding ui_vmsize and swap reserve by using atomics Change swap_reserve and swap_total to be in units of pages so that swap reservations can be done using only atomics instead of using a single global mutex for swap_reserve and a single mutex for all processes running under the same uid for uid accounting. Results in mmap speed up and a 70% increase in brk calls / second. Reviewed by: alc@, markj@, kib@ Approved by: re (delphij@) Differential Revision: https://reviews.freebsd.org/D16273	2018-10-05 05:50:56 +00:00
Mark Johnston	93db904d19	Use an unsigned iterator for domain sets. Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range [0, MAXMEMDOM). Reported by: pho Reviewed by: kib Approved by: re (rgrimes) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17374	2018-10-01 18:51:39 +00:00
Andrew Gallatin	30c5525b3c	Allow empty NUMA memory domains to support Threadripper2 The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has only 2 memory controllers enabled. This means that only 2 of the 4 NUMA domains can be populated with physical memory, and the others are empty. Add support to FreeBSD for empty NUMA domains by: - creating empty memory domains when parsing the SRAT table, rather than failing to parse the table - not running the pageout deamon threads in empty domains - adding defensive code to UMA to avoid allocating from empty domains - adding defensive code to cpuset to avoid binding to an empty domain Thanks to Jeff for suggesting this strategy. Reviewed by: alc, markj Approved by: re (gjb@) Differential Revision: https://reviews.freebsd.org/D1683	2018-10-01 14:14:21 +00:00
Konstantin Belousov	c62637d679	Correct vm_fault_copy_entry() handling of backing file truncation after the file mapping was wired. if a wired map entry is backed by vnode and the file is truncated, corresponding pages are invalidated. vm_fault_copy_entry() should be aware of it and allow for invalid pages past end of file. Also, such pages should be not mapped into userspace. If userspace accesses the truncated part of the mapping later, it gets a signal, there is no way kernel can prevent the page fault. Reported by: andrew using syzkaller Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17323	2018-09-28 14:11:38 +00:00
Konstantin Belousov	9f25ab83f9	In vm_fault_copy_entry(), we should not assert that entry is charged if the dst_object is not of swap type. It can only happen when entry does not require copy, otherwise vm_map_protect() already adds the charge. So the assert was right for the case where swap object was allocated in the vm_fault_copy_entry(), but not when it was just copied from src_entry and its type is not swap. Reported by: andrew using syzkaller Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17323	2018-09-28 14:11:01 +00:00
Konstantin Belousov	a60d3db15e	In vm_fault_copy_entry(), collect the code to initialize a newly allocated dst_object in a single place. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17323	2018-09-28 14:10:12 +00:00
Mark Johnston	463406ac4a	Add more NUMA-specific low memory predicates. Use these predicates instead of inline references to vm_min_domains. Also add a global all_domains set, akin to all_cpus. Reviewed by: alc, jeff, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17278	2018-09-24 19:24:17 +00:00
Alan Cox	f5fbe90de4	Passing UMA_ZONE_NOFREE to uma_zcreate() for swpctrie_zone and swblk_zone is redundant, because uma_zone_reserve_kva() is performed on both zones and it sets this same flag on the zone. (Moreover, the implementation of the swap pager does not itself require these zones to be UMA_ZONE_NOFREE.) Reviewed by: kib, markj Approved by: re (gjb) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17296	2018-09-24 16:49:02 +00:00
Mark Johnston	3d14a7bb43	Ensure that "domain" is initialized when vm_ndomains == 1. Reported by: alc Approved by: re (gjb)	2018-09-24 15:32:46 +00:00
Mark Johnston	969e147aff	Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned. The old code appears to assume that vmem_alloc() would import size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't provide this guarantee. Also remove the unused global RWX arena and add comments explaining why we have per-domain arenas. Reported by: alc Reviewed by: alc, kib (previous version) Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17249	2018-09-20 18:29:55 +00:00
Mark Johnston	25ed23cfbb	Change the domain selection policy in kmem_back(). Ensure that pages backing the same virtual large page come from the same physical domain, as kmem_malloc_domain() does. PR: 231038 Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17248	2018-09-20 15:45:12 +00:00
Mark Johnston	1aed6d48a8	Move kernel vmem arena initialization to vm_kern.c. This keeps the initialization coupled together with the kmem_* KPI implementation, which is the main user of these arenas. No functional change intended. Reviewed by: alc Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17247	2018-09-19 19:13:43 +00:00
Mateusz Guzik	c035292545	vm: check for empty kstack cache before locking The current cache logic checks the total number of stacks in the kernel, which even on small boxes significantly exceeds the 128 limit (e.g. an 8-way box with zfs has almost 800 stacks allocated). Stacks are cached earlier for each main thread. As a result the code is rarely executed, but when it is then (on boxes like the above) it always fails. Since there are no provisions made for NUMA and release time is approaching, just do a quick check to avoid acquiring the lock. Approved by: re (kib)	2018-09-19 16:02:33 +00:00
Mark Johnston	26fe2217bf	Only update the domain cursor once in keg_fetch_slab(). We drop the keg lock when we go to actually allocate the slab, allowing other threads to advance the cursor. This can cause us to exit the round-robin loop before having attempted allocations from all domains, resulting in a hang during a subsequent blocking allocation attempt from a depleted domain. Reported and tested by: Jan Bramkamp <crest@bultmann.eu> Reviewed by: alc, cem Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17209	2018-09-18 17:51:45 +00:00
Mateusz Guzik	2554f86a8d	vm: stop taking proc lock in mmap to satisfy racct if it is disabled Limits can be safely obtained with lim_cur from the thread. racct is compiled in but disabled by default. Note that racct enablement is a boot-only tunable. This eliminates second most common place of taking the lock while pkg building. While here don't take the lock in mlockall either. Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17210	2018-09-18 01:24:30 +00:00
Mark Johnston	7a364d458a	Split some checks in vm_page_activate() to make it easier to read. No functional change intended. Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17028	2018-09-10 18:59:23 +00:00
Mark Johnston	5a7f993702	Relax an assertion in vm_pqbatch_process_page(). While executing vm_pqbatch_process_page(m), m->queue may change to PQ_NONE if the page daemon is concurrently freeing the page. In this case m's queue state flags must be clear, so vm_pqbatch_process_page() will be a no-op, but the race could cause spurious assertion failures. Correct the assertion which assumed that m->queue's value does not change while the page queue lock is held. Reviewed by: alc, kib Reported and tested by: pho Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17027	2018-09-08 21:49:43 +00:00
Mark Johnston	c56c7299c2	Use the correct terminology. Reported by: kib Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D16191	2018-09-06 20:02:19 +00:00
Mark Johnston	23984ce5cd	Avoid resource deadlocks when one domain has exhausted its memory. Attempt other allowed domains if the requested domain is below the minimum paging threshold. Block in fork only if all domains available to the forking thread are below the severe threshold rather than any. Submitted by: jeff Reported by: mjg Reviewed by: alc, kib, markj Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D16191	2018-09-06 19:28:52 +00:00
Mark Johnston	21f01f4584	Remove vm_page_remque(). Testing m->queue != PQ_NONE is not sufficient; see the commit log message for r338276. As of r332974 vm_page_dequeue() handles already-dequeued pages, so just replace vm_page_remque() calls with vm_page_dequeue() calls. Reviewed by: kib Tested by: pho Approved by: re (marius) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17025	2018-09-06 16:17:45 +00:00
Alan Cox	72aebdd742	Recent changes have created, for the first time, physical memory segments that can be coalesced. To be clear, fragmentation of phys_avail[] is not the cause. This fragmentation of vm_phys_segs[] arises from the "special" calls to vm_phys_add_seg(), in other words, not those that derive directly from phys_avail[], but those that we create for the initial kernel page table pages and now for the kernel and modules loaded at boot time. Since we sometimes iterate over the physical memory segments, coalescing these segments at initialization time is a worthwhile change. Reviewed by: kib, markj Approved by: re (rgrimes) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16976	2018-09-02 18:29:38 +00:00
Konstantin Belousov	f0165b1ca6	Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines. Exposing max_offset and min_offset defines in public headers is causing clashes with variable names, for example when building QEMU. Based on the submission by: royger Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Approved by: re (marius) Differential revision: https://reviews.freebsd.org/D16881	2018-08-29 12:24:19 +00:00
Mark Murray	19fa89e938	Remove the Yarrow PRNG algorithm option in accordance with due notice given in random(4). This includes updating of the relevant man pages, and no-longer-used harvesting parameters. Ensure that the pseudo-unit-test still does something useful, now also with the "other" algorithm instead of Yarrow. PR: 230870 Reviewed by: cem Approved by: so(delphij,gtetlow) Approved by: re(marius) Differential Revision: https://reviews.freebsd.org/D16898	2018-08-26 12:51:46 +00:00
Alan Cox	49bfa624ac	Eliminate the arena parameter to kmem_free(). Implicitly this corrects an error in the function hypercall_memfree(), where the wrong arena was being passed to kmem_free(). Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are mapped in kmem with execute permissions. Use this flag to determine which arena the kmem virtual addresses are returned to. Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it redundant. Update the nearby comment for UMA_SLAB_KERNEL. Reviewed by: kib, markj Discussed with: jeff Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16845	2018-08-25 19:38:08 +00:00
Gleb Smirnoff	306abf0f35	Either "free" or "allocated" is misleading here, since an item in a bucket is free from perspective of UMA consumer, and it is allocated from perspective of keg. Discussed with: markj Approved by: re (kib)	2018-08-24 18:47:50 +00:00
Gleb Smirnoff	a307fb5b0c	Fix comment. The actual meaning of ub_cnt is the opposite.	2018-08-23 23:24:28 +00:00
Mark Johnston	899fe184c7	Add a per-pagequeue pdpages counter. Expose these counters under the vm.domain sysctl node. The existing vm.stats.vm.v_pdpages sysctl is preserved. Reviewed by: alc (previous version) Differential Revision: https://reviews.freebsd.org/D14666	2018-08-23 21:03:45 +00:00
Mark Johnston	99d92d732f	Ensure that queue state is cleared when vm_page_dequeue() returns. Per-page queue state is updated non-atomically, with either the page lock or the page queue lock held. When vm_page_dequeue() is called without the page lock, in rare cases a different thread may be concurrently dequeuing the page with the pagequeue lock held. Because of the non-atomic update, vm_page_dequeue() might return before queue state is completely updated, which can lead to race conditions. Restrict the vm_page_dequeue() interface so that it must be called either with the page lock held or on a free page, and busy wait when a different thread is concurrently updating queue state, which must happen in a critical section. While here, do some related cleanup: inline vm_page_dequeue_locked() into its only caller and delete a prototype for the unimplemented vm_page_requeue_locked(). Replace the volatile qualifier for "queue" added in r333703 with explicit uses of atomic_load_8() where required. Reported and tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D15980	2018-08-23 20:34:22 +00:00
Alan Cox	83a90bffd8	Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter became unused in FreeBSD 12.x as a side-effect of the NUMA-related changes.) Reviewed by: kib, markj Discussed with: jeff, re@ Differential Revision: https://reviews.freebsd.org/D16825	2018-08-21 16:43:46 +00:00
Alan Cox	44d0efb215	Eliminate kmem_alloc_contig()'s unused arena parameter. Reviewed by: hselasky, kib, markj Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D16799	2018-08-20 15:57:27 +00:00
Alan Cox	db7c2a4822	Eliminate the unused arena parameter from kmem_alloc_attr(). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D16793	2018-08-18 22:07:48 +00:00
Alan Cox	067fd85894	Eliminate the arena parameter to kmem_malloc_domain(). It is redundant. The domain and flags parameters suffice. In fact, the related functions kmem_alloc_{attr,contig}_domain() don't have an arena parameter. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D16713	2018-08-18 18:33:50 +00:00
Konstantin Belousov	c1344d2bbe	Prevent some parallel swap-ins, rate-limit swapper swap-ins. If faultin() was called outside swapper (from PHOLD()), do not allow swapper to initiate additional swap-ins. Swapper' initiated swap-ins are serialized because they are synchronous and executed in the context of the thread0. With the added limitation, we only allow parallel swap-ins from PHOLD(), which is up to PHOLD() users to manage, usually they do not need to. Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds interval, counting faultin() swapins. Suggested by: alc Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16610	2018-08-13 16:48:46 +00:00
Mark Johnston	b50a4ea646	Account for the lowmem handlers in the inactive queue scan target. Before r329882 the target would be computed after lowmem handlers run and free pages. On some systems a significant amount of page reclamation happens this way. However, with r329882 the target is computed first, which can lead to unnecessary reclamation from the page cache, and this in turn may result in excessive swapping. Instead, adjust the target after running lowmem handlers. Don't invoke the lowmem handlers before the PID controller, though, since that would hide the true rate of page allocation. Reviewed by: alc, kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16606	2018-08-09 18:25:49 +00:00
Alan Cox	2bf8cb3804	Add support for pmap_enter(..., psind=1) to the armv6 pmap. In other words, add support for explicitly requesting that pmap_enter() create a 1 MB page mapping. (Essentially, this feature allows the machine-independent layer to create superpage mappings preemptively, and not wait for automatic promotion to occur.) Export pmap_ps_enabled() to the machine-independent layer. Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail or reclaim a PV entry when one is not available. Refactor pmap_enter_pte1() into two functions, one by the same name, that is a general-purpose function for creating pte1 mappings, and another, pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute- only mappings for execve(2), mmap(2), and shmat(2). In addition, as an optimization to pmap_enter(..., psind=0), eliminate the use of pte2_is_managed() from pmap_enter(). Unlike the x86 pmap implementations, armv6 does not have a managed bit defined within the PTE. So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n) in the number of vm_phys_segs[]. All but one call to PHYS_TO_VM_PAGE() in pmap_enter() can be avoided. Reviewed by: kib, markj, mmel Tested by: mmel MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D16555	2018-08-08 16:55:01 +00:00
Alan Cox	78f1deeffe	Defer and aggregate swap_pager_meta_build frees. Before swp_pager_meta_build replaces an old swapblk with an new one, it frees the old one. To allow such freeing of blocks to be aggregated, have swp_pager_meta_build return the old swap block, and make the caller responsible for freeing it. Define a pair of short static functions, swp_pager_init_freerange and swp_pager_update_freerange, to do the initialization and updating of blk addresses and counters used in aggregating blocks to be freed. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib, markj (an earlier version) Tested by: pho MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13707	2018-08-08 02:30:34 +00:00
Konstantin Belousov	a70e9a1388	Swap in WKILLED processes. Swapped-out process that is WKILLED must be swapped in as soon as possible. The reason is that such process can be killed by OOM and its pages can be only freed if the process exits. To exit, the kernel stack of the process must be mapped. When allocating pages for the stack of the WKILLED process on swap in, use VM_ALLOC_SYSTEM requests to increase the chance of the allocation to succeed. Add counter of the swapped out processes to avoid unneeded iteration over the allprocs list when there is no work to do, reducing the allproc_lock ownership. Reviewed by: alc, markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16489	2018-08-04 20:45:43 +00:00

1 2 3 4 5 ...

3938 Commits