freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	013072f04c	Fix pre-SI_SUB_CPU initialization of per-CPU counters. r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the backend allocator for PCPU UMA zones. Unlike page_alloc(), it does not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that. r336020 also changed counter(9) to initialize each counter using a CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU, smp_rendezvous() will only execute the callback on the current CPU (i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed by virtue of the fact that UMA gratuitously zeroes slabs when importing them into a zone. Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by iterating over the full range of CPU IDs when zeroing counters, ignoring whether the corresponding bits in all_cpus are set. Reported and tested by: pho (previous version) Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D16190	2018-07-10 00:18:12 +00:00
Sean Bruno	a03af34228	Wrap the declaration and assignment of "stripe" with #ifdef NUMA declarations as not all targets are NUMA aware. Found with gcc. Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16113	2018-07-07 13:37:44 +00:00
Matt Macy	ab3059a8e7	Back pcpu zone with domain correct pages - Change pcpu zone consumers to use a stride size of PAGE_SIZE. (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier) - Allocate page from the correct domain for a given cpu. - Don't initialize pc_domain to non-zero value if NUMA is not defined There are some misconceptions surrounding this field. It is the _VM_ NUMA domain and should only ever correspond to valid domain values as understood by the VM. The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain. Reviewed by: jeff Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15933	2018-07-06 02:06:03 +00:00
Ian Lepore	c5b7751fa2	Eliminate a spurious panic on non-SMP systems (occurred on shutdown/reboot).	2018-06-22 20:22:26 +00:00
Ruslan Bukin	b47999470d	Fix uma_zalloc_pcpu_arg() operation in case of !SMP build. Reviewed by: mjg Sponsored by: DARPA, AFRL	2018-06-21 11:43:54 +00:00
Jonathan T. Looney	0766f278d8	Make UMA and malloc(9) return non-executable memory in most cases. Most kernel memory that is allocated after boot does not need to be executable. There are a few exceptions. For example, kernel modules do need executable memory, but they don't use UMA or malloc(9). The BPF JIT compiler also needs executable memory and did use malloc(9) until r317072. (Note that a side effect of r316767 was that the "small allocation" path in UMA on amd64 already returned non-executable memory. This meant that some calls to malloc(9) or the UMA zone(9) allocator could return executable memory, while others could return non-executable memory. This change makes the behavior consistent.) This change makes malloc(9) return non-executable memory unless the new M_EXEC flag is specified. After this change, the UMA zone(9) allocator will always return non-executable memory, and a KASSERT will catch attempts to use the M_EXEC flag to allocate executable memory using uma_zalloc() or its variants. Allocations that do need executable memory have various choices. They may use the M_EXEC flag to malloc(9), or they may use a different VM interfact to obtain executable pages. Now that malloc(9) again allows executable allocations, this change also reverts most of r317072. PR: 228927 Reviewed by: alc, kib, markj, jhb (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15691	2018-06-13 17:04:41 +00:00
Mateusz Guzik	4e180881ae	uma: implement provisional api for per-cpu zones Per-cpu zone allocations are very rarely done compared to regular zones. The intent is to avoid pessimizing the latter case with per-cpu specific code. In particular contrary to the claim in r334824, M_ZERO is sometimes being used for such zones. But the zeroing method is completely different and braching on it in the fast path for regular zones is a waste of time.	2018-06-08 21:40:03 +00:00
Mateusz Guzik	b8af2820f6	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.	2018-06-08 05:40:36 +00:00
Mateusz Guzik	ea99223ec9	uma: remove M_ZERO support for pcpu zones Nothing in the tree uses it and pcpu zones have a fundamentally different use case than the regular zones - they are not supposed to be allocated and freed all the time. This reduces pollution in the allocation fast path.	2018-06-08 03:16:16 +00:00
Gleb Smirnoff	c5deaf0452	UMA memory debugging enabled with INVARIANTS consists of two things: trashing freed memory and checking that allocated memory is properly trashed, and also of keeping a bitset of freed items. Trashing/checking creates a lot of CPU cache poisoning, while keeping debugging bitsets consistent creates a lot of contention on UMA zone lock(s). The performance difference between INVARIANTS kernel and normal one is mostly attributed to UMA debugging, rather than to all KASSERT checks in the kernel. Add loader tunable vm.debug.divisor that allows either to turn off UMA debugging completely, or turn it on only for a fraction of allocations, while still running all KASSERTs in kernel. That allows to run INVARIANTS kernels in production environments without reducing load by orders of magnitude, but still doing useful extra checks. Default value is 1, meaning debug every allocation. Value of 0 would disable UMA debugging completely. Values above 1 enable debugging only for every N-th item. It isn't possible to strictly follow the number, but still amount of debugging is reduced roughly by (N-1)/N percent. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15199	2018-06-08 00:15:08 +00:00
Mateusz Guzik	e825ab8d89	uma: whack main zone counter update in the slow path Cached counters are typically zero at this point so it performs avoidable atomics. Everything reading them also reads the cached ones, thus there is really no point. Reviewed by: jeff	2018-04-27 05:37:35 +00:00
Mark Johnston	7e28037a09	Add a UMA zone flag to disable the use of buckets. This allows the creation of zones which don't do any caching in front of the keg. If the zone is a cache zone, this means that UMA will not attempt any memory allocations when allocating an item from the backend. This is intended for use after a panic by netdump, but likely has other applications. Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15184	2018-04-24 20:05:45 +00:00
Gleb Smirnoff	b92b26ad08	Use UMA_SLAB_SPACE macro. No functional change here.	2018-04-02 05:15:25 +00:00
Gleb Smirnoff	96a10340ce	In uma_startup_count() handle special case when zone will fit into single slab, but with alignment adjustment it won't. Again, when there is only one item in a slab alignment can be ignored. See previous revision of this file for more info. PR: 227116	2018-04-02 05:14:31 +00:00
Gleb Smirnoff	1ca6ed4589	Handle a special case when a slab can fit only one allocation, and zone has a large alignment. With alignment taken into account uk_rsize will be greater than space in a slab. However, since we have only one item per slab, it is always naturally aligned. Code that will panic before this change with 4k page: z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0); uma_zalloc(z, M_WAITOK); A practical scenario to hit the panic is a machine with 56 CPUs and 2 NUMA domains, which yields in zone size of 3984. PR: 227116 MFC after: 2 weeks	2018-04-02 05:11:59 +00:00
Jeff Roberson	e8bb2dc7c9	Add the flag ZONE_NOBUCKETCACHE. This flag instructions UMA not to keep a cache of fully populated buckets. This will be used in a follow-on commit. The flag idea was originally from markj. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-04-01 04:47:05 +00:00
Konstantin Belousov	63b5d112b6	For vm_zone_stats() sysctl handler, do not drain sbuf calling copyout(9) while owning zone lock. Despite old value sysctl buffer is wired, spurious faults might still occur. Note that we still own the uma_rwlock there, but this lock does not participate in sensitive lock orders. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:48:53 +00:00
Gleb Smirnoff	f7d3578564	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
Gleb Smirnoff	5073a08328	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
Gleb Smirnoff	d2be4a1e4f	Use correct arithmetic to calculate how many pages we need for kegs and hashes. There is no functional change with current sizes.	2018-02-06 22:13:40 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Gleb Smirnoff	1616767dfc	Improve DIAGNOSTIC printf. Report using a boot page every time regardless of booted status.	2018-02-06 22:08:43 +00:00
Gleb Smirnoff	f4bef67c9c	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
Jeff Roberson	b6715dab8f	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Jeff Roberson	ad5b0f5b51	Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494	2018-01-02 04:35:56 +00:00
Konstantin Belousov	200f8117ba	Perform all accesses to uma_reclaim_needed using atomic(9) KPI. Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 10:06:55 +00:00
Mark Johnston	952a29c04b	Fix the UMA reclaim worker after r326347. atomic_set_*() sets a bit in the target memory location, so atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like it does. PR: 224080 Reviewed by: jeff, kib Differential Revision: https://reviews.freebsd.org/D13412	2017-12-07 19:38:09 +00:00
Jeff Roberson	2e47807c21	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187	2017-11-28 23:40:54 +00:00
Pedro F. Giffuni	fe267a5590	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.	2017-11-27 15:23:17 +00:00
Konstantin Belousov	772c8b6749	Fix operator priority. Sponsored by: The FreeBSD Foundation	2017-11-08 23:25:05 +00:00
Jeff Roberson	8d6fbbb867	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
Mark Johnston	2934eb8a22	Fix a logic error in the item size calculation for internal UMA zones. Kegs for internal zones always keep the slab header in the slab itself. Therefore, when determining the allocation size, we need to take the slab header size into account. Reported and tested by: ae, rakuco Reviewed by: avg MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12342	2017-09-13 15:44:54 +00:00
Mateusz Guzik	fe933c1d88	Start annotating global _padalign locks with __exclusive_cache_line While these locks are guarnteed to not share their respective cache lines, their current placement leaves unnecessary holes in lines which preceeded them. For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour cachelines (previously separate by the lock) to be collapsed into 1. The annotation is only effective on architectures which have it implemented in their linker script (currently only amd64). Thus locks are not converted to their not-padaligned variants as to not affect the rest. MFC after: 1 week	2017-09-06 20:28:18 +00:00
Gleb Smirnoff	77e1943785	When we are in UMA_STARTUP use startup_alloc() for any zone, not for internal zones only. This allows to create new zones at early stages of boot, without need to mark them as internal to UMA, which isn't always true. Reviewed by: alc	2017-06-08 21:33:19 +00:00
Gleb Smirnoff	1431a74845	As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs.	2017-06-01 18:36:52 +00:00
Gleb Smirnoff	ac0a6fd015	Simplify boot pages management in UMA. It is simply a contigous virtual memory pointer and number of pages. There is no need to build a linked list here. Just increment pointer and decrement counter. The only functional difference to old allocator is that before we gave pages from topmost and down to lowest, and now we give them in normal ascending order. While here remove padalign from a mutex that is unused at runtime. Reviewed by: alc	2017-06-01 18:26:57 +00:00
John Baldwin	a5a355788e	Assert that the align parameter to uma_zcreate() is valid. Reviewed by: kib MFC after: 1 week Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D10100	2017-04-04 16:26:46 +00:00
Andriy Gapon	57223e9994	uma: fix pages <-> items conversions at several places Those places were not taking into account uk_ppera. At present one allocation is always used by one slab, so uk_ppera must be used to convert between pages and slabs. uk_ipers is used to convert between slabs and items. MFC after: 1 month (if ever)	2017-03-11 16:43:38 +00:00
Andriy Gapon	a55ebb7cd5	uma: eliminate uk_slabsize field The field was not used beyond the initial keg setup stage anyway. MFC after: 1 month (if ever)	2017-03-11 16:35:36 +00:00
Andriy Gapon	9b43bc27c4	call vm_lowmem hook in uma_reclaim_worker A comment near kmem_reclaim() implies that we already did that. Calling the hook is useful, because some handlers, e.g. ARC, might be able to release significant amounts of KVA. Now that we have more than one place where vm_lowmem hook is called, use this change as an opportunity to introduce flags that describe a reason for calling the hook. No handler makes use of the flags yet. Reviewed by: markj, kib MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9764	2017-02-25 16:39:21 +00:00
Justin Hibbits	b5345ef10e	Print flags in hex instead of decimal. Hex is easier to grok for flags, and consistent with other prints.	2017-01-02 16:50:52 +00:00
Mark Johnston	829be5168d	Simplify keg_drain() a bit by using LIST_FOREACH_SAFE. MFC after: 1 week	2016-10-20 23:10:27 +00:00
Mark Johnston	afa5d70339	Release the second critical section in uma_zfree_arg() slightly earlier. It is only needed when removing a full bucket from the per-CPU cache. The bucket cache (uz_buckets) is protected by the zone mutex and thus the critical section can be released before inserting into that list. MFC after: 1 week	2016-07-20 01:01:50 +00:00
Nathan Whitehorn	96c85efb4b	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)	2016-07-06 14:09:49 +00:00
Mark Johnston	bc9d08e1cf	Fix memguard(9) in kernels with INVARIANTS enabled. With r284861, UMA zones use the trash ctor and dtor by default. This is incompatible with memguard, which frees the backing page when the item is freed. Modify the UMA debug functions to be no-ops if the item was allocated from memguard. This also fixes constructors such as mb_ctor_pack(), which invokes the trash ctor in addition to performing some initialization. Reviewed by: glebius MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D6562	2016-06-01 22:31:35 +00:00
Pedro F. Giffuni	763df3ec55	sys/vm: minor spelling fixes in comments. No functional change.	2016-05-02 20:16:29 +00:00
Gleb Smirnoff	cfcae3f86f	Remove UMA_ZONE_REFCNT feature, now unused. Blessed by: jeff	2016-03-01 00:33:32 +00:00
Gleb Smirnoff	e60b2fcbeb	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl	2016-02-03 23:30:17 +00:00
Gleb Smirnoff	9542ea7b80	Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows to make uma_dbg.h not depend on uma_int.h, which allows to uninclude uma_int.h from the mbuf(9) allocator.	2016-02-03 22:02:36 +00:00

1 2 3 4 5 ...

290 Commits