freebsd-skq

Author	SHA1	Message	Date
Mateusz Guzik	4e180881ae	uma: implement provisional api for per-cpu zones Per-cpu zone allocations are very rarely done compared to regular zones. The intent is to avoid pessimizing the latter case with per-cpu specific code. In particular contrary to the claim in r334824, M_ZERO is sometimes being used for such zones. But the zeroing method is completely different and braching on it in the fast path for regular zones is a waste of time.	2018-06-08 21:40:03 +00:00
Mateusz Guzik	b8af2820f6	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.	2018-06-08 05:40:36 +00:00
Mateusz Guzik	ea99223ec9	uma: remove M_ZERO support for pcpu zones Nothing in the tree uses it and pcpu zones have a fundamentally different use case than the regular zones - they are not supposed to be allocated and freed all the time. This reduces pollution in the allocation fast path.	2018-06-08 03:16:16 +00:00
Gleb Smirnoff	c5deaf0452	UMA memory debugging enabled with INVARIANTS consists of two things: trashing freed memory and checking that allocated memory is properly trashed, and also of keeping a bitset of freed items. Trashing/checking creates a lot of CPU cache poisoning, while keeping debugging bitsets consistent creates a lot of contention on UMA zone lock(s). The performance difference between INVARIANTS kernel and normal one is mostly attributed to UMA debugging, rather than to all KASSERT checks in the kernel. Add loader tunable vm.debug.divisor that allows either to turn off UMA debugging completely, or turn it on only for a fraction of allocations, while still running all KASSERTs in kernel. That allows to run INVARIANTS kernels in production environments without reducing load by orders of magnitude, but still doing useful extra checks. Default value is 1, meaning debug every allocation. Value of 0 would disable UMA debugging completely. Values above 1 enable debugging only for every N-th item. It isn't possible to strictly follow the number, but still amount of debugging is reduced roughly by (N-1)/N percent. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15199	2018-06-08 00:15:08 +00:00
Mateusz Guzik	e825ab8d89	uma: whack main zone counter update in the slow path Cached counters are typically zero at this point so it performs avoidable atomics. Everything reading them also reads the cached ones, thus there is really no point. Reviewed by: jeff	2018-04-27 05:37:35 +00:00
Mark Johnston	7e28037a09	Add a UMA zone flag to disable the use of buckets. This allows the creation of zones which don't do any caching in front of the keg. If the zone is a cache zone, this means that UMA will not attempt any memory allocations when allocating an item from the backend. This is intended for use after a panic by netdump, but likely has other applications. Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15184	2018-04-24 20:05:45 +00:00
Gleb Smirnoff	b92b26ad08	Use UMA_SLAB_SPACE macro. No functional change here.	2018-04-02 05:15:25 +00:00
Gleb Smirnoff	96a10340ce	In uma_startup_count() handle special case when zone will fit into single slab, but with alignment adjustment it won't. Again, when there is only one item in a slab alignment can be ignored. See previous revision of this file for more info. PR: 227116	2018-04-02 05:14:31 +00:00
Gleb Smirnoff	1ca6ed4589	Handle a special case when a slab can fit only one allocation, and zone has a large alignment. With alignment taken into account uk_rsize will be greater than space in a slab. However, since we have only one item per slab, it is always naturally aligned. Code that will panic before this change with 4k page: z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0); uma_zalloc(z, M_WAITOK); A practical scenario to hit the panic is a machine with 56 CPUs and 2 NUMA domains, which yields in zone size of 3984. PR: 227116 MFC after: 2 weeks	2018-04-02 05:11:59 +00:00
Jeff Roberson	e8bb2dc7c9	Add the flag ZONE_NOBUCKETCACHE. This flag instructions UMA not to keep a cache of fully populated buckets. This will be used in a follow-on commit. The flag idea was originally from markj. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-04-01 04:47:05 +00:00
Konstantin Belousov	63b5d112b6	For vm_zone_stats() sysctl handler, do not drain sbuf calling copyout(9) while owning zone lock. Despite old value sysctl buffer is wired, spurious faults might still occur. Note that we still own the uma_rwlock there, but this lock does not participate in sensitive lock orders. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:48:53 +00:00
Gleb Smirnoff	f7d3578564	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
Gleb Smirnoff	5073a08328	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
Gleb Smirnoff	d2be4a1e4f	Use correct arithmetic to calculate how many pages we need for kegs and hashes. There is no functional change with current sizes.	2018-02-06 22:13:40 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Gleb Smirnoff	1616767dfc	Improve DIAGNOSTIC printf. Report using a boot page every time regardless of booted status.	2018-02-06 22:08:43 +00:00
Gleb Smirnoff	f4bef67c9c	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
Jeff Roberson	b6715dab8f	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Jeff Roberson	ad5b0f5b51	Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494	2018-01-02 04:35:56 +00:00
Konstantin Belousov	200f8117ba	Perform all accesses to uma_reclaim_needed using atomic(9) KPI. Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 10:06:55 +00:00
Mark Johnston	952a29c04b	Fix the UMA reclaim worker after r326347. atomic_set_*() sets a bit in the target memory location, so atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like it does. PR: 224080 Reviewed by: jeff, kib Differential Revision: https://reviews.freebsd.org/D13412	2017-12-07 19:38:09 +00:00
Jeff Roberson	2e47807c21	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187	2017-11-28 23:40:54 +00:00
Pedro F. Giffuni	fe267a5590	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.	2017-11-27 15:23:17 +00:00
Konstantin Belousov	772c8b6749	Fix operator priority. Sponsored by: The FreeBSD Foundation	2017-11-08 23:25:05 +00:00
Jeff Roberson	8d6fbbb867	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
Mark Johnston	2934eb8a22	Fix a logic error in the item size calculation for internal UMA zones. Kegs for internal zones always keep the slab header in the slab itself. Therefore, when determining the allocation size, we need to take the slab header size into account. Reported and tested by: ae, rakuco Reviewed by: avg MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12342	2017-09-13 15:44:54 +00:00
Mateusz Guzik	fe933c1d88	Start annotating global _padalign locks with __exclusive_cache_line While these locks are guarnteed to not share their respective cache lines, their current placement leaves unnecessary holes in lines which preceeded them. For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour cachelines (previously separate by the lock) to be collapsed into 1. The annotation is only effective on architectures which have it implemented in their linker script (currently only amd64). Thus locks are not converted to their not-padaligned variants as to not affect the rest. MFC after: 1 week	2017-09-06 20:28:18 +00:00
Gleb Smirnoff	77e1943785	When we are in UMA_STARTUP use startup_alloc() for any zone, not for internal zones only. This allows to create new zones at early stages of boot, without need to mark them as internal to UMA, which isn't always true. Reviewed by: alc	2017-06-08 21:33:19 +00:00
Gleb Smirnoff	1431a74845	As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs.	2017-06-01 18:36:52 +00:00
Gleb Smirnoff	ac0a6fd015	Simplify boot pages management in UMA. It is simply a contigous virtual memory pointer and number of pages. There is no need to build a linked list here. Just increment pointer and decrement counter. The only functional difference to old allocator is that before we gave pages from topmost and down to lowest, and now we give them in normal ascending order. While here remove padalign from a mutex that is unused at runtime. Reviewed by: alc	2017-06-01 18:26:57 +00:00
John Baldwin	a5a355788e	Assert that the align parameter to uma_zcreate() is valid. Reviewed by: kib MFC after: 1 week Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D10100	2017-04-04 16:26:46 +00:00
Andriy Gapon	57223e9994	uma: fix pages <-> items conversions at several places Those places were not taking into account uk_ppera. At present one allocation is always used by one slab, so uk_ppera must be used to convert between pages and slabs. uk_ipers is used to convert between slabs and items. MFC after: 1 month (if ever)	2017-03-11 16:43:38 +00:00
Andriy Gapon	a55ebb7cd5	uma: eliminate uk_slabsize field The field was not used beyond the initial keg setup stage anyway. MFC after: 1 month (if ever)	2017-03-11 16:35:36 +00:00
Andriy Gapon	9b43bc27c4	call vm_lowmem hook in uma_reclaim_worker A comment near kmem_reclaim() implies that we already did that. Calling the hook is useful, because some handlers, e.g. ARC, might be able to release significant amounts of KVA. Now that we have more than one place where vm_lowmem hook is called, use this change as an opportunity to introduce flags that describe a reason for calling the hook. No handler makes use of the flags yet. Reviewed by: markj, kib MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9764	2017-02-25 16:39:21 +00:00
Justin Hibbits	b5345ef10e	Print flags in hex instead of decimal. Hex is easier to grok for flags, and consistent with other prints.	2017-01-02 16:50:52 +00:00
Mark Johnston	829be5168d	Simplify keg_drain() a bit by using LIST_FOREACH_SAFE. MFC after: 1 week	2016-10-20 23:10:27 +00:00
Mark Johnston	afa5d70339	Release the second critical section in uma_zfree_arg() slightly earlier. It is only needed when removing a full bucket from the per-CPU cache. The bucket cache (uz_buckets) is protected by the zone mutex and thus the critical section can be released before inserting into that list. MFC after: 1 week	2016-07-20 01:01:50 +00:00
Nathan Whitehorn	96c85efb4b	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)	2016-07-06 14:09:49 +00:00
Mark Johnston	bc9d08e1cf	Fix memguard(9) in kernels with INVARIANTS enabled. With r284861, UMA zones use the trash ctor and dtor by default. This is incompatible with memguard, which frees the backing page when the item is freed. Modify the UMA debug functions to be no-ops if the item was allocated from memguard. This also fixes constructors such as mb_ctor_pack(), which invokes the trash ctor in addition to performing some initialization. Reviewed by: glebius MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D6562	2016-06-01 22:31:35 +00:00
Pedro F. Giffuni	763df3ec55	sys/vm: minor spelling fixes in comments. No functional change.	2016-05-02 20:16:29 +00:00
Gleb Smirnoff	cfcae3f86f	Remove UMA_ZONE_REFCNT feature, now unused. Blessed by: jeff	2016-03-01 00:33:32 +00:00
Gleb Smirnoff	e60b2fcbeb	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl	2016-02-03 23:30:17 +00:00
Gleb Smirnoff	9542ea7b80	Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows to make uma_dbg.h not depend on uma_int.h, which allows to uninclude uma_int.h from the mbuf(9) allocator.	2016-02-03 22:02:36 +00:00
Jonathan T. Looney	54503a13d8	Add a safety net to reclaim mbufs when one of the mbuf zones become exhausted. It is possible for a bug in the code (or, theoretically, even unusual network conditions) to exhaust all possible mbufs or mbuf clusters. When this occurs, things can grind to a halt fairly quickly. However, we currently do not call mb_reclaim() unless the entire system is experiencing a low-memory condition. While it is best to try to prevent exhaustion of one of the mbuf zones, it would also be useful to have a mechanism to attempt to recover from these situations by freeing "expendable" mbufs. This patch makes two changes: a) The patch adds a generic API to the UMA zone allocator to set a function that should be called when an allocation fails because the zone limit has been reached. Because of the way this function can be called, it really should do minimal work. b) The patch uses this API to try to free mbufs when an allocation fails from one of the mbuf zones because the zone limit has been reached. The function schedules a callout to run mb_reclaim(). Differential Revision: https://reviews.freebsd.org/D3864 Reviewed by: gnn Comments by: rrs, glebius MFC after: 2 weeks Sponsored by: Juniper Networks	2015-12-20 02:05:33 +00:00
Mark Johnston	d9e2e68d38	Don't make assertions about td_critnest when the scheduler is stopped. A panicking thread always executes with a critical section held, so any attempt to allocate or free memory while dumping will otherwise cause a second panic. This can occur, for example, if xpt_polled_action() completes non-dump I/O that was pending at the time of the panic. The fact that this can occur is itself a bug, but asserting in this case does little but reduce the reliability of kernel dumps. Suggested by: kib Reported by: pho	2015-12-11 20:05:07 +00:00
Jonathan T. Looney	1067a2ba68	Consistently enforce the restriction against calling malloc/free when in a critical section. uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the zone. The malloc() family of functions may call uma_zalloc_arg() or uma_zalloc_free(). The malloc(9) man page currently claims that free() will never sleep. It also implies that the malloc() family of functions will not sleep when called with M_NOWAIT. However, it is more correct to say that these functions will not sleep indefinitely. Indeed, they may acquire a sleepable lock. However, a developer may overlook this restriction because the WITNESS check that catches attempts to call the malloc() family of functions within a critical section is inconsistenly applied. This change clarifies the language of the malloc(9) man page to clarify the restriction against calling the malloc() family of functions while in a critical section or holding a spin lock. It also adds KASSERTs at appropriate points to make the enforcement of this restriction more consistent. PR: 204633 Differential Revision: https://reviews.freebsd.org/D4197 Reviewed by: markj Approved by: gnn (mentor) Sponsored by: Juniper Networks	2015-11-19 14:04:53 +00:00
Alan Cox	087a613247	Exploit r288122 to address a cosmetic issue. Since the pages allocated by noobj_alloc() don't belong to a vm object, they can't be paged out. Since they can't be paged out, they are never enqueued in a paging queue. Nonetheless, passing PQ_INACTIVE to vm_page_unwire() creates the appearance that these pages are being enqueued in the inactive queue. As of r288122, we can avoid giving this false impression by passing PQ_NONE. Submitted by: kmacy Differential Revision: https://reviews.freebsd.org/D1674	2015-09-26 17:45:10 +00:00
Mateusz Guzik	19c591bfe2	Don't trash memory from UMA_ZONE_NOFREE zones. Objects obtained from such zones are supposed to retain type stability, which was violated by aforementioned trashing. This is a follow-up to r284861. Discussed with: kib	2015-09-02 23:09:01 +00:00
Mark Murray	e866d8f05b	Make the UMA harvesting go away completely if not wanted. Default to "not wanted". Provide and document the RANDOM_ENABLE_UMA option. Change RANDOM_FAST to RANDOM_UMA to clarify the harvesting. Remove RANDOM_DEBUG option, replace with SDT probes. These will be of use to folks measuring the harvesting effect when deciding whether to use RANDOM_ENABLE_UMA. Requested by: scottl and others. Approved by: so (/dev/random blanket) Differential Revision: https://reviews.freebsd.org/D3197	2015-08-22 12:59:05 +00:00

1 2 3 4 5 ...

284 Commits