freebsd-skq

Author	SHA1	Message	Date
Mark Johnston	325c4ced0d	Restore the reservation of boot pages for bucket zones after r355707. uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(), but before that assignment, startup_alloc() will use pages from the reserved pool, so the bucket zones themselves are still allocated using startup pages. Reviewed by: rlibby Reported by: Jenkins via lwhsu Differential Revision: https://reviews.freebsd.org/D22797	2019-12-13 18:28:01 +00:00
Ryan Libby	d82c8ffb16	Revert r355706 & r355710 The quick fix didn't work. I'll sort it out tomorrow. Revert r355710: "libmemstat: unbreak build" Revert r355706: "uma dbg: flexible size for slab debug bitset too"	2019-12-13 11:21:28 +00:00
Ryan Libby	f7af501519	uma: report slab efficiency Reviewed by: jeff Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22766	2019-12-13 09:32:09 +00:00
Ryan Libby	3182660a85	uma: delay bucket_init() until we might actually enable buckets This helps with a bootstrapping problem in upcoming work. We don't first enable buckets until uma_startup2(), so we can delay bucket creation until then. The other two paths to bucket_enable() are both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM) and one in uma_timeout() (first activated in uma_startup3()). Note that although some bucket functions are accessible before uma_startup2() (e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone. Discussed with: jeff Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22765	2019-12-13 09:32:03 +00:00
Ryan Libby	7508f15ff1	uma dbg: flexible size for slab debug bitset too Recently (r355315) the size of the struct uma_slab bitset field us_free became dynamic instead of conservative. Now, make the debug bitset size dynamic too. The debug bitset is INVARIANTS-only, so in fact we don't care too much about the space savings that results from this, but enabling minimally-sized slabs on INVARIANTS builds is still important in order to be able to test new slab layouts effectively. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22759	2019-12-13 09:31:59 +00:00
Ryan Libby	6d204a6a0e	uma: pretty print zone flags sysctl Requested by: jeff Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22748	2019-12-11 06:50:55 +00:00
Jeff Roberson	3b490537f4	Fix two problems with r355149. The sysctl name collision code assumed that zones would never be freed. In the case of tmpfs this was not true. While here test for the right bit to disable the keg related sysctls for zones that don't have kegs. Reported by: pho Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22655	2019-12-08 01:55:23 +00:00
Jeff Roberson	1e0701e1e5	Use a variant slab structure for offpage zones. This saves space in embedded slabs but also is an opportunity to tidy up code and add accessor inlines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22609	2019-12-08 01:15:06 +00:00
Andrew Turner	b75c4efcd2	Fix the signature for zone_import and zone_release These are cast to uma_import and uma_release functions. Use the signature for these in the zone functions. This was found with an experimental Kernel CFI. It will complain if the signature is different than what a function pointer expects. The simplest way to fix these is to correct the signature. Reviewed by: rlibby Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22671	2019-12-04 18:40:05 +00:00
Jeff Roberson	9b78b1f433	Use a precise bit count for the slab free items in UMA. This significantly shrinks embedded slab structures. Reviewed by: markj, rlibby (prior version) Differential Revision: https://reviews.freebsd.org/D22584	2019-12-02 22:44:34 +00:00
Jeff Roberson	6d6a03d7a8	Handle large mallocs by going directly to kmem. Taking a detour through UMA does not provide any additional value. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22563	2019-11-29 03:14:10 +00:00
Jeff Roberson	584061b480	Garbage collect the mostly unused us_keg field. Use appropriately named union members in vm_page.h to store the zone and slab. Remove some nearby dead code. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22564	2019-11-28 07:49:25 +00:00
Ryan Libby	35ec24f362	uma: move sysctl vm.uma defn out from under INVARIANTS Fix non-INVARIANTS builds after r355149. Reported by: Michael Butler <imb@protected-networks.net> Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22588	2019-11-28 04:15:16 +00:00
Jeff Roberson	20a4e15451	Implement a sysctl tree for uma zones to assist in debugging and provide more statistcs than are exported via the ABI stable vmstat interface. Rename uz_count to uz_bucket_size because even I was confused by the name after returning to the source years later. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22554	2019-11-28 00:19:09 +00:00
Jeff Roberson	0a81b4395e	Refactor uma_zfree_arg into several functions to make control flow more clear and icache usage cleaner. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22491	2019-11-27 23:19:06 +00:00
Ryan Libby	ca293436d1	uma: trash memory when ctor/dtor supplied too On INVARIANTS kernels, UMA has a use-after-free detection mechanism. This mechanism previously required that all of the ctor/dtor/uminit/fini arguments to uma_zcreate() be NULL in order to function. Now, it only requires that uminit and fini be NULL; now, the trash ctor and dtor will be called in addition to any supplied ctor or dtor. Also do a little refactoring for readability of the resulting logic. This enables use-after-free detection for more zones, and will allow for simplification of some callers that worked around the previous restriction (see kern_mbuf.c). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20722	2019-11-27 19:49:55 +00:00
Jeff Roberson	beb8beef81	Refactor uma_zalloc_arg(). It is a mess of gotos and code which doesn't make sense after many partial refactors. Attempt to make a smaller cache footprint for the fast path. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22470	2019-11-26 22:17:02 +00:00
Mark Johnston	003cf08ba9	Revise the page cache size policy. In r353734 the use of the page caches was limited to systems with a relatively large amount of RAM per CPU. This was to mitigate some issues reported with the system not able to keep up with memory pressure in cases where it had been able to do so prior to the addition of the direct free pool cache. This change re-enables those caches. The change modifies uma_zone_set_maxcache(), which was introduced specifically for the page cache zones. Rather than using it to limit only the full bucket cache, have it also set uz_count_max to provide an upper bound on the per-CPU cache size that is consistent with the number of items requested. Remove its return value since it has no use. Enable the page cache zones unconditionally, and limit them to 0.1% of the domain's pages. The limit can be overridden by the vm.pgcache_zone_max tunable as before. Change the item size parameter passed to uma_zcache_create() to the correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page cache buckets to be adaptively sized, like the rest of UMA's caches. This also causes the initial bucket size to be small, so only systems which benefit from large caches will get them. Reviewed by: gallatin, jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22393	2019-11-22 16:30:47 +00:00
Jeff Roberson	71353f7a2f	When we set OFFPAGE to limit fragmentation we should also set VTOSLAB so that we avoid the hashtables. The hashtable is now only required if a zone is created with OFFPAGE specified initially, not internally. This flag signals to UMA that it can't touch the allocated memory and so can't store a slab pointer in the containing page. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22453	2019-11-20 01:57:33 +00:00
Konstantin Belousov	08034d1006	Include cache zones into zone_foreach() where appropriate. The r354367 is reverted since it is subsumed by this, more complete, approach. Suggested by: markj Reviewed by: alc. glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22242	2019-11-10 09:25:19 +00:00
Konstantin Belousov	432fc36da1	Switch cache zones from early counters to real implementation. Early counter mock can be only used on BSP for amd64, when APs try to update it that causes random memory corruption. N.B. This is a temporary patch to plug the corruption for now, while a proper solution for handling cache zones in zone_foreach() is being developed. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation, Mellanox Technologies	2019-11-05 21:38:48 +00:00
Mark Johnston	1de9724e55	Avoid reloading bucket pointers in uma_vm_zone_stats(). The correctness of per-CPU cache accounting in that function is dependent on reading per-CPU pointers exactly once. Ensure that the compiler does not emit multiple loads of those pointers. Reported and tested by: pho Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22081	2019-10-22 14:20:06 +00:00
Conrad Meyer	0223790f8d	Fix braino in r353429 cy@ points out that I got parameter order backwards between definition and invocation of the helper function. He is totally correct. The earlier version of this patch predated the XFree column so this is one I introduced, rather than the original author. Submitted by: cy Reported by: cy X-MFC-With: r353429	2019-10-11 06:02:03 +00:00
Conrad Meyer	46d70077be	ddb: Add CSV option, sorting to 'show (malloc\|uma)' Add /i option for machine-parseable CSV output. This allows ready copy/ pasting into more sophisticated tooling outside of DDB. Add total zone size ("Memory Use") as a new column for UMA. For both, sort the displayed list on size (print the largest zones/types first). This is handy for quickly diagnosing "where has my memory gone?" at a high level. Submitted by: Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version) Sponsored by: Dell EMC Isilon	2019-10-11 01:31:31 +00:00
Mark Johnston	08cfa56ea3	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667	2019-09-01 22:22:43 +00:00
Mark Johnston	b48d4efe75	Handle UMA_ANYDOMAIN in kstack_import(). The kernel thread stack zone performs first-touch allocations by default, and must handle the case where the local memory domain is empty. For most UMA zones this is handled in the keg layer, but cache zones currently must implement a policy for this case. Simply use a round-robin policy if UMA_ANYDOMAIN is passed. Reported and tested by: bcran Reviewed by: kib Sponsored by: The FreeBSD Foundation	2019-08-25 21:14:46 +00:00
Jeff Roberson	eda1b01647	Implement a MINBUCKET zone flag so we can use minimal caching on zones that may be expensive to cache. Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20930	2019-08-06 23:04:59 +00:00
Jeff Roberson	c168508655	Add two new kernel options to control memory locality on NUMA hardware. - UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that was freed on a different domain from where it was allocated. This is only used for UMA_ZONE_NUMA (first-touch) zones. - UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all zones. This tries to maintain locality for kernel memory. Reviewed by: gallatin, alc, kib Tested by: pho, gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20929	2019-08-06 21:50:34 +00:00
Mark Johnston	88ea538a98	Replace uses of vm_page_unwire(m, PQ_NONE) with vm_page_unwire_noq(m). These calls are not the same in general: the former will dequeue the page if it is enqueued, while the latter will just leave it alone. But, all existing uses of the former apply to unmanaged pages, which are never enqueued in the first place. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20470	2019-06-07 18:23:29 +00:00
Alexander Motin	3b2f2cb8e9	Allow UMA hash tables to expand faster then 2x in 20 seconds. ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables. With initial hash table size of only 32 elements it takes ~20 expansions or ~400 seconds to adapt to handling 220GB ZFS ARC. During that time not only the hash table is highly inefficient, but also each of those expan- sions takes significant time with the lock held, blocking operation. On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change reduces time needed to first time read 240GB from ~300-400s, during which system is quite busy and unresponsive, to only ~150s with light CPU load and just 5 sub-second CPU spikes to expand the hash table. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-06-06 23:57:28 +00:00
Mark Johnston	fbd9585915	Add sysctls for uma_kmem_{limit,total}. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20514	2019-06-06 16:26:58 +00:00
Mark Johnston	058f0f7464	Remove the volatile qualifer from uma_kmem_total. No functional change intended. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20514	2019-06-06 16:23:44 +00:00
Gleb Smirnoff	4a9f6ba75b	In r343857 the referred comment moved to uma_vm_zone_stats().	2019-05-29 22:33:37 +00:00
Tycho Nightingale	323ad38632	for a cache-only zone the destructor tries to destroy a non-existent keg Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D19835	2019-04-12 12:46:25 +00:00
Pedro F. Giffuni	6929b7d1ab	UMA: unsign some variables related to allocation in hash_alloc(). As a followup to r343673, unsign some variables related to allocation since the hashsize cannot be negative. This gives a bit more space to handle bigger allocations and avoid some implicit casting. While here also unsign uh_hashmask, it makes little sense to keep that signed. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19148	2019-02-12 04:33:05 +00:00
Gleb Smirnoff	ad66f95865	Now that there is only one way to allocate a slab, remove uz_slab method. Discussed with: jeff	2019-02-07 03:55:05 +00:00
Gleb Smirnoff	b47acb0a4d	Report cache zones in UMA stats sysctl, that 'vmstat -z' uses. This should had been part of r251826.	2019-02-07 03:32:45 +00:00
Alexander Motin	59568a0e52	Fix integer math overflow in UMA hash_alloc(). 512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items. To manage them UMA tries to allocate 2GB hash table, which size does not fit into the int variable, causing later allocation failure, which makes ARC shrink back below the 512GB, not letting it to use more RAM. With this change I easily reached >700GB ARC size on 768GB RAM machine. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-02-02 04:11:59 +00:00
Gleb Smirnoff	37125720b9	In zone_alloc_bucket() max argument was calculated based on uz_count. Then bucket_alloc() also selects bucket size based on uz_count. However, since zone lock is dropped, uz_count may reduce. In this case max may be greater than ub_entries and that would yield into writing beyond end of the allocation. Reported by: pho	2019-01-31 17:52:48 +00:00
Mark Johnston	862203935e	Correct uma_prealloc()'s use of domainset iterators after r339925. The iterator should be reinitialized after every successful slab allocation. A request to advance the iterator is interpreted as an allocation failure, so a sufficiently large preallocation would cause the iterator to believe that all domains were exhausted, resulting in a sleep with the keg lock held. [1] Also, keg_alloc_slab() should pass the unmodified wait flag to the item initialization routine, which may use it to perform allocations from other zones. Reported and tested by: slavah Diagnosed by: kib [1] Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-01-23 18:58:15 +00:00
Gleb Smirnoff	e7e4bcd856	style(9): break long line.	2019-01-15 18:50:11 +00:00
Gleb Smirnoff	f8c86a5fde	Remove harmless leftover from code that cycles over zone's kegs. Just use + instead of +=. There is no functional change.	2019-01-15 18:49:31 +00:00
Gleb Smirnoff	bb45b411e2	Only do uz_items accounting for zones that have a limit set in uz_max_items. This reduces amount of locking required for these zones. Also, for cache only zones (UMA_ZFLAG_CACHE) accounting uz_items wasn't correct at all, since they may allocate items directly from their backing store and then free them via UMA underflowing uz_items. Tested by: pho	2019-01-15 18:32:26 +00:00
Gleb Smirnoff	2efcc8cbca	Make uz_allocs, uz_frees and uz_fails counter(9). This removes some atomic updates and reduces amount of data protected by zone lock. During startup point these fields to EARLY_COUNTER. After startup allocate them for all early zones. Tested by: pho	2019-01-15 18:24:34 +00:00
Gleb Smirnoff	5a8eee2bb4	Fix compilation on 32-bit.	2019-01-15 03:43:46 +00:00
Gleb Smirnoff	bb15d1c778	o Move zone limit from keg level up to zone level. This means that now two zones sharing a keg may have different limits. Now this is going to work: zone = uma_zcreate(); uma_zone_set_max(zone, limit); zone2 = uma_zsecond_create(zone); uma_zone_set_max(zone2, limit2); Kegs no longer have uk_maxpages field, but zones have uz_items. When set, it may be rounded up to minimum possible CPU bucket cache size. For small limits bucket cache can also be reconfigured to be smaller. Counter uz_items is updated whenever items transition from keg to a bucket cache or directly to a consumer. If zone has uz_maxitems set and it is reached, then we are going to sleep. o Since new limits don't play well with multi-keg zones, remove them. The idea of multi-keg zones was introduced exactly 10 years ago, and never have had a practical usage. In discussion with Jeff we came to a wild agreement that if we ever want to reintroduce the idea of a smart allocator that would be able to choose between two (or more) totally different backing stores, that choice should be made one level higher than UMA, e.g. in malloc(9) or in mget(), or whatever and choice should be controlled by the caller. o Sleeping code is improved to account number of sleepers and wake them one by one, to avoid thundering herd problem. o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache() KPI added. Having no bucket cache basically means setting maxcache to 0. o Now with many fields added and many removed (no multi-keg zones!) make sure that struct uma_zone is perfectly aligned. Reviewed by: markj, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D17773	2019-01-15 00:02:06 +00:00
Gleb Smirnoff	0b2e3aead3	Fix yet another edge case in uma_startup_count(). If zone size fits into several pages, but leaves no space for struct uma_slab at the end we miscalculate number of pages by one. Totally mimic keg_large_init() math here to cover that problem. Reported by: gallatin	2018-11-28 19:54:02 +00:00
Gleb Smirnoff	3d5e3df73f	For not offpage zones the slab is placed at the end of page. Keg's uk_pgoff is calculated to guarantee that struct uma_slab is placed at pointer size alignment. Calculation of real struct uma_slab size is done in keg_ctor() and yet again in keg_large_init(), to check if we need an extra page. This calculation can actually be performed at compile time. - Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at an end of a page with alignment requirement. - Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not a functional change. - Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init(). This is a potential bugfix, but in reality I don't think there are any systems affected, since compiler aligns struct uma_slab anyway.	2018-11-28 19:17:27 +00:00
Mark Johnston	0f9b7bf37a	Add accounting to per-domain UMA full bucket caches. In particular, track the current size of the cache and maintain an estimate of its working set size. This will be used to decide how much to shrink various caches when the kernel attempts to reclaim pages. As a secondary effect, it makes statistics aggregation (done by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer needs to iterate over lists of cached buckets. Discussed with: alc, glebius, jeff Tested by: pho (previous version) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D16666	2018-11-13 19:44:40 +00:00
Mark Johnston	9978bd996b	Add malloc_domainset(9) and _domainset variants to other allocator KPIs. Remove malloc_domain(9) and most other _domain KPIs added in r327900. The new functions allow the caller to specify a general NUMA domain selection policy, rather than specifically requesting an allocation from a specific domain. The latter policy tends to interact poorly with M_WAITOK, resulting in situations where a caller is blocked indefinitely because the specified domain is depleted. Most existing consumers of the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy, in which we fall back to other domains to satisfy the allocation request. This change also defines a set of DOMAINSET_FIXED() policies, which only permit allocations from the specified domain. Discussed with: gallatin, jeff Reported and tested by: pho (previous version) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17418	2018-10-30 18:26:34 +00:00

1 2 3 4 5 ...

350 Commits