freebsd-dev

Author	SHA1	Message	Date
Matt Macy	5072a5f465	malloc: avoid possibly returning stack garbage if MALLOC_DEBUG is defined	2018-05-19 04:43:49 +00:00
Matt Macy	06bf2a6aef	Add simple preempt safe epoch API Read locking is over used in the kernel to guarantee liveness. This API makes it easy to provide livenes guarantees without atomics. Includes epoch_test kernel module to stress test the API. Documentation will follow initial use case. Test case and improvements to preemption handling in response to discussion with mjg@ Reviewed by: imp@, shurd@ Approved by: sbruno@	2018-05-10 17:55:24 +00:00
Mateusz Guzik	7cd794214a	dtrace: depessimize dtmalloc when dtrace is active Each malloc/free was testing dtrace_malloc_enabled and forcing extra reads from the malloc type struct to see if perhaps a dtmalloc probe was on. Treat it like lockstat and sdt: have a global bolean.	2018-04-24 01:06:20 +00:00
Mateusz Guzik	c9e05ccd62	malloc: stop reading the subzone if MALLOC_DEBUG_MAXZONES == 1 (the default) malloc was showing at the top of profile during while running microbenchmarks. #define DTMALLOC_PROBE_MAX 2 struct malloc_type_internal { uint32_t mti_probes[DTMALLOC_PROBE_MAX]; u_char mti_zone; struct malloc_type_stats mti_stats[MAXCPU]; }; Reading mti_zone it wastes a cacheline to hold mti_probes + mti_zone (which we know is 0) + part of malloc stats of the first cpu which on top induces false-sharing. In particular will-it-scale lock1_processes -t 128 -s 10: before: average:45879692 after: average:51655596 Note the counters can be padded but the right fix is to move them to counter(9), leaving the struct read-only after creation (modulo dtrace probes).	2018-04-23 22:28:49 +00:00
Gleb Smirnoff	f7d3578564	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
Gleb Smirnoff	f4bef67c9c	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
Li-Wen Hsu	5a70796a71	Fix build for architectures where size_t is not unsigned long Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D14045	2018-01-25 06:37:14 +00:00
Conrad Meyer	bd555da94b	malloc(9): Change nominal size to size_t to match standard C No functional change -- size_t matches unsigned long on all platforms. Reported by: bde Discussed with: jhb Sponsored by: Dell EMC Isilon	2018-01-24 19:37:18 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Conrad Meyer	c02fc9607a	mallocarray(9): panic if the requested allocation would overflow Additionally, move the overflow check logic out to WOULD_OVERFLOW() for consumers to have a common means of testing for overflowing allocations. WOULD_OVERFLOW() should be a secondary check -- on 64-bit platforms, just because an allocation won't overflow size_t does not mean it is a sane size to request. Callers should be imposing reasonable allocation limits far, far, below overflow. Discussed with: emaste, jhb, kp Sponsored by: Dell EMC Isilon	2018-01-10 21:49:45 +00:00
Kristof Provost	fd91e076c1	Introduce mallocarray() in the kernel Similar to calloc() the mallocarray() function checks for integer overflows before allocating memory. It does not zero memory, unless the M_ZERO flag is set. Reviewed by: pfg, vangyzen (previous version), imp (previous version) Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D13766	2018-01-07 13:21:01 +00:00
Jeff Roberson	2e47807c21	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187	2017-11-28 23:40:54 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Ed Maste	69a2875821	Renumber license clauses in sys/kern to avoid skipping #3	2016-09-15 13:16:20 +00:00
Mark Johnston	5e0a6f31e5	Move IPv6 malloc tag definitions into the IPv6 code.	2016-05-20 04:45:08 +00:00
Gleb Smirnoff	b28cc462ad	Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a requirement for uma_int.h. Suggested by: jhb	2016-02-09 20:22:35 +00:00
Gleb Smirnoff	e60b2fcbeb	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl	2016-02-03 23:30:17 +00:00
Mark Johnston	d9e2e68d38	Don't make assertions about td_critnest when the scheduler is stopped. A panicking thread always executes with a critical section held, so any attempt to allocate or free memory while dumping will otherwise cause a second panic. This can occur, for example, if xpt_polled_action() completes non-dump I/O that was pending at the time of the panic. The fact that this can occur is itself a bug, but asserting in this case does little but reduce the reliability of kernel dumps. Suggested by: kib Reported by: pho	2015-12-11 20:05:07 +00:00
Jonathan T. Looney	1067a2ba68	Consistently enforce the restriction against calling malloc/free when in a critical section. uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the zone. The malloc() family of functions may call uma_zalloc_arg() or uma_zalloc_free(). The malloc(9) man page currently claims that free() will never sleep. It also implies that the malloc() family of functions will not sleep when called with M_NOWAIT. However, it is more correct to say that these functions will not sleep indefinitely. Indeed, they may acquire a sleepable lock. However, a developer may overlook this restriction because the WITNESS check that catches attempts to call the malloc() family of functions within a critical section is inconsistenly applied. This change clarifies the language of the malloc(9) man page to clarify the restriction against calling the malloc() family of functions while in a critical section or holding a spin lock. It also adds KASSERTs at appropriate points to make the enforcement of this restriction more consistent. PR: 204633 Differential Revision: https://reviews.freebsd.org/D4197 Reviewed by: markj Approved by: gnn (mentor) Sponsored by: Juniper Networks	2015-11-19 14:04:53 +00:00
Konstantin Belousov	44ec2b63c5	The vmem callback to reclaim kmem arena address space on low or fragmented conditions currently just wakes up the pagedaemon. The kmem arena is significantly smaller then the total available physical memory, which means that there are loads where kmem arena space could be exhausted, while there is a lot of pages available still. The woken up pagedaemon sees vm_pages_needed != 0, verifies the condition vm_paging_needed() which is false, clears the pass and returns back to sleep, not calling neither uma_reclaim() nor lowmem handler. To handle low kmem arena conditions, create additional pagedaemon thread which calls uma_reclaim() directly. The thread sleeps on the dedicated channel and kmem_reclaim() wakes the thread in addition to the pagedaemon. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-09 20:08:36 +00:00
Ian Lepore	1eafc07856	Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl strings returned to userland include the nulterm byte. Some uses of sbuf_new_for_sysctl() write binary data rather than strings; clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in those cases. (Note that the sbuf code still automatically adds a nulterm byte in sbuf_finish(), but since it's not included in the length it won't get copied to userland along with the binary data.) Remove explicit adding of a nulterm byte in a couple places now that it gets done automatically by the sbuf drain code. PR: 195668	2015-03-14 17:08:28 +00:00
Sean Bruno	7c51714e0a	svn revisions r269964 and r269963 seemed to have impaired small memory footprint systems(32M/64M) and didn't leave enough free memory to load modules when it was setting up page tables that for sizes that are never used on these smallish boards. Set kmem_zmax to PAGE_SIZE on these smaller systems (< 128M) to keep this from happening. Verified on mips32 h/w. PR: 193465 Submitted by: delphij Reviewed by: adrian	2014-09-22 05:07:22 +00:00
Xin LI	7001d850bb	Add a new loader tunable, vm.kmem_zmax which allows a system administrator to limit the maximum allocation size that malloc(9) would consider using the UMA cache allocator as backend. Suggested by: alfred MFC after: 2 weeks	2014-08-14 05:31:39 +00:00
Xin LI	bda06553fd	Re-instate UMA cached backend for 4K - 64K allocations. New consumers like geli(4) uses malloc(9) to allocate temporary buffers that gets free'ed shortly, causing frequent TLB shootdown as observed in hwpmc supported flame graph. Discussed with: jeff, alfred MFC after: 1 week	2014-08-14 05:13:24 +00:00
Hans Petter Selasky	4813ad54f8	Compile fixes: Remove duplicate "debug_ktr.mask" sysctl definition. Remove now unused variable from "kern_ktr.c". This fixes build of "ktr" which was broken by r267961. Let the default value for "vm_kmem_size_scale" be zero. It is setup after that the sysctl has been initialized from "getenv()" in the "kmeminit()" function to equal the "VM_KMEM_SIZE_MAX" value, if zero. On Sparc64 the "VM_KMEM_SIZE_MAX" macro is not a constant. This fixes build of Sparc64 which was broken by r267961. Add a special macro to dynamically create SYSCTL root nodes, because root nodes have a special parent. This fixes build of existing OFED module and CANBUS module for pc98 which was broken by r267961. Add missing "sysctl.h" includes to get the needed sysctl header file declarations. This is needed after r267961. MFC after: 2 weeks	2014-06-28 17:36:18 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Dimitry Andric	f9d498ad60	On sparc64, VM_KMEM_SIZE_SCALE is not a constant expression, so it cannot be tested in a CTASSERT().	2014-02-23 17:37:24 +00:00
Attilio Rao	54366c0bd7	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip	2013-11-25 07:38:45 +00:00
Alan Cox	c70af4875e	As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other words, every architecture is now auto-sizing the kmem arena. This revision changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes mandatory and the definition of VM_KMEM_SIZE becomes optional. Replace or eliminate all existing definitions of VM_KMEM_SIZE. With auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for clarity. Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to that of setting the tunable vm.kmem_size. Whereas the macros VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set the size of the kmem arena at compile-time without that value being overridden by auto-sizing. Update the nearby comments to reflect the kmem submap being replaced by the kmem arena. Stop duplicating the auto-sizing formula in every machine- dependent vmparam.h and place it in kmeminit() where auto-sizing takes place. Reviewed by: kib (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-11-08 16:25:00 +00:00
Alan Cox	61083fcc61	Tidy up kmeminit(): Since r245575, 'nmbclusters' is calculated after kmeminit() runs, so it contributes nothing to 'vm_kmem_size'; update a comment to reflect that r254025 replaced the kmem submap with the kmem arena. Reviewed by: kib Approved by: re (gjb) Sponsored by: EMC / Isilon Storage Division	2013-10-05 18:53:03 +00:00
Jeff Roberson	99de9af2a6	- Disable quantum caches on the kmem_arena. This can make fragmentation worse on small KVA systems. I had intended to only enable it for debugging. Sponsored by: EMC / Isilon Storage Division	2013-08-13 22:41:24 +00:00
Olivier Houchard	e137643ef3	Instead of just trying to do it for arm, make sure vm_kmem_size is properly aligned in kmeminit(), where it'll work for any arch. Suggested by: alc	2013-08-09 22:30:54 +00:00
Jeff Roberson	5df87b21d3	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
Konstantin Belousov	1f3ad93be7	Remove unused malloc type. Requested by: alc MFC after: 1 week	2013-08-01 12:55:41 +00:00
Marius Strobl	94bfd5b1a0	Try to improve r242655 take III: move these SYSCTLs describing the kernel map, which is defined and initialized in vm/vm_kern.c, to the latter. Submitted by: alc	2013-02-04 09:35:48 +00:00
Marius Strobl	e8cbe54bc4	Further improve r242655 and supply VM_{MIN,MAX}_KERNEL_ADDRESS as constant values to SYSCTL_ULONG(9) where possible. Submitted by: bde	2013-02-03 21:43:55 +00:00
Marius Strobl	c882264c95	Make r242655 build on sparc64. While at it, make vm_{max,min}_kernel_address vm_offset_t as they should be.	2012-11-08 08:10:32 +00:00
Alfred Perlstein	fc6874bcbb	export VM_MIN_KERNEL_ADDRESS and VM_MAX_KERNEL_ADDRESS via sysctl. On several platforms the are determined by too many nested #defines to be easily discernible. This will aid in development of auto-tuning.	2012-11-06 04:10:32 +00:00
Matthew D Fleming	f806cdcf99	Fix a bug with memguard(9) on 32-bit architectures without a VM_KMEM_MAX_SIZE. The code was not taking into account the size of the kernel_map, which the kmem_map is allocated from, so it could produce a sub-map size too large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely and base the memguard map's size off the kernel_map's size, since this is always relevant and always smaller. Found by: Justin Hibbits	2012-07-15 20:29:48 +00:00
John Baldwin	687c94aac9	Honor db_pager_quit in 'show uma' and 'show malloc'. MFC after: 1 month	2012-07-02 16:14:52 +00:00
John Baldwin	831ce4cb3d	- Change contigmalloc() to use the vm_paddr_t type instead of an unsigned long for specifying a boundary constraint. - Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary constraints. These allow boundary constraints to be fully expressed for cases where sizeof(bus_addr_t) != sizeof(bus_size_t). Specifically, it allows a driver to properly specify a 4GB boundary in a PAE kernel. Note that this cannot be safely MFC'd without a lot of compat shims due to KBI changes, so I do not intend to merge it. Reviewed by: scottl	2012-03-01 19:58:34 +00:00
Alan Cox	ea3f07d3a0	Eliminate stale numbers from a comment.	2011-12-07 16:27:23 +00:00
Alan Cox	c749c003b8	Eliminate the possibility of 32-bit arithmetic overflow in the calculation of vm_kmem_size that may occur if the system administrator has specified a vm.vm_kmem_size tunable value that exceeds the hard cap. PR: 162741 Submitted by: Adam McDougall Reviewed by: bde@ MFC after: 3 weeks	2011-12-07 07:03:14 +00:00
Ed Schouten	6472ac3d8a	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.	2011-11-07 15:43:11 +00:00
Alan Cox	f346986b76	contigmalloc(9) and contigfree(9) are now implemented in terms of other more general VM system interfaces. So, their implementation can now reside in kern_malloc.c alongside the other functions that are declared in malloc.h.	2011-10-27 02:52:24 +00:00
Gleb Smirnoff	8d689e042f	Make memguard(9) capable to guard uma(9) allocations.	2011-10-12 18:08:28 +00:00
Alan Cox	1549ed03ff	Fix the handling of an empty kmem map by sysctl_kmem_map_free(). In the unlikely event that sysctl_kmem_map_free() was performed on an empty kmem map, it would incorrectly report the free space as zero. Discussed with: avg MFC after: 1 week	2011-10-08 18:29:30 +00:00
Alan Cox	e9a3f7852d	Modestly increase the maximum allowed size of the kmem map on i386. Also, express this new maximum as a fraction of the kernel's address space size rather than a constant so that increasing KVA_PAGES will automatically increase this maximum. As a side-effect of this change, kern.maxvnodes will automatically increase by a proportional amount. While I'm here ensure that this change doesn't result in an unintended increase in maxpipekva on i386. Calculate maxpipekva based upon the size of the kernel address space and the amount of physical memory instead of the size of the kmem map. The memory backing pipes is not allocated from the kmem map. It is allocated from its own submap of the kernel map. In short, it has no real connection to the kmem map. (In fact, the commit messages for the maxpipekva auto-sizing talk about using the kernel map size, cf. r117325 and r117391, even though the implementation actually used the kmem map size.) Although the calculation is now done differently, the resulting value for maxpipekva should remain almost the same on i386. However, on amd64, the value will be reduced by 2/3. This is intentional. The recent change to VM_KMEM_SIZE_SCALE on amd64 for the benefit of ZFS also had the unnecessary side-effect of increasing maxpipekva. This change is effectively restoring maxpipekva on amd64 to its prior value. Eliminate init_param3() since it is no longer used.	2011-03-23 16:38:29 +00:00
Matthew D Fleming	00f0e671ff	Explicitly wire the user buffer rather than doing it implicitly in sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT drain for extremely large amounts of data where the caller knows that appropriate references are held, and sleeping is not an issue. Inspired by: rwatson	2011-01-27 00:34:12 +00:00
Andriy Gapon	95bb9d38b8	add kmem_map_free sysctl: query largest contiguous free range in kmem_map Suggested by: alc Reviewed by: alc MFC after: 1 week	2010-10-09 09:03:17 +00:00
Andriy Gapon	7814c80a5b	vm.kmem_map_size: a sysctl to query current kmem_map->size Based on a patch from Sandvine Incorporated via emaste. Reviewed by: emaste MFC after: 1 week	2010-10-07 18:11:33 +00:00
Andriy Gapon	d801e824f6	kmem_size* sysctls: hint that these are also tunables MFC after: 1 week	2010-09-30 16:45:27 +00:00
Matthew D Fleming	4e6571599b	Re-add r212370 now that the LOR in powerpc64 has been resolved: Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough SBUF_FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk (original patch)	2010-09-16 16:13:12 +00:00
Matthew D Fleming	404a593e28	Revert r212370, as it causes a LOR on powerpc. powerpc does a few unexpected things in copyout(9) and so wiring the user buffer is not sufficient to perform a copyout(9) while holding a random mutex. Requested by: nwhitehorn	2010-09-13 18:48:23 +00:00
Matthew D Fleming	dd67e2103c	Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk	2010-09-09 18:33:46 +00:00
Matthew D Fleming	6d3ed393d6	The realloc case for memguard(9) will copy too many bytes when reallocating to a smaller-sized allocation. Fix this issue. Noticed by: alc Reviewed by: alc Approved by: zml (mentor) MFC after: 3 weeks	2010-08-31 16:57:58 +00:00
Matthew D Fleming	e3813573bd	Rework memguard(9) to reserve significantly more KVA to detect use-after-free over a longer time. Also release the backing pages of a guarded allocation at free(9) time to reduce the overhead of using memguard(9). Allow setting and varying the malloc type at run-time. Add knobs to allow: - randomly guarding memory - adding un-backed KVA guard pages to detect underflow and overflow - a lower limit on the size of allocations that are guarded Reviewed by: alc Reviewed by: brueffer, Ulrich Spörlein <uqs spoerlein net> (man page) Silence from: -arch Approved by: zml (mentor) MFC after: 1 month	2010-08-11 22:10:37 +00:00
Matthew D Fleming	d7854da193	Add MALLOC_DEBUG_MAXZONES debug malloc(9) option to use multiple uma zones for each malloc bucket size. The purpose is to isolate different malloc types into hash classes, so that any buffer overruns or use-after-free will usually only affect memory from malloc types in that hash class. This is purely a debugging tool; by varying the hash function and tracking which hash class was corrupted, the intersection of the hash classes from each instance will point to a single malloc type that is being misused. At this point inspection or memguard(9) can be used to catch the offending code. Add MALLOC_DEBUG_MAXZONES=8 to -current GENERIC configuration files. The suggestion to have this on by default came from Kostik Belousov on -arch. This code is based on work by Ron Steinke at Isilon Systems. Reviewed by: -arch (mostly silence) Reviewed by: zml Approved by: zml (mentor)	2010-07-28 15:36:12 +00:00
Ed Schouten	60ae52f785	Use ISO C99 integer types in sys/kern where possible. There are only about 100 occurences of the BSD-specific u_int*_t datatypes in sys/kern. The ISO C99 integer types are used here more often.	2010-06-21 09:55:56 +00:00
Brian Somers	f121baaa45	If we're passed garbage in malloc_init(), panic() rather than expecting a KASSERT to handle it. People are likely to turn off INVARIANTS RSN and loading an old module can cause garbage-in here. I saw the issue with an older nvidia driver (x11/nvidia-driver) loading into a new kernel - a crash wasn't seen 'till sysctl_kern_malloc_stats(). I was lucky that mtp->ks_shortdesc was NULL and not something horrible. While I'm here, KASSERT that malloc_uninit() isn't passed something that's not in kmemstatistics. MFC after: 3 weeks	2009-06-05 09:16:52 +00:00
Warner Losh	e678f09a15	Retire kern.vm.kmem.size. It was marked as obsolete prior to 5.2, so it can go.	2009-05-09 19:00:47 +00:00
Robert Watson	bb1c7df80f	struct malloc_type has had a 'magic' field statically initialized to M_MAGIC by MALLOC_DEFINE() for a long time; add assertions that malloc_type's passed to malloc(), free(), etc have that magic set. MFC after: 2 weeks	2009-04-19 12:41:37 +00:00
Ed Schouten	c90c9021e9	Remove even more unneeded variable assignments. kern_time.c: - Unused variable `p'. kern_thr.c: - Variable `error' is always caught immediately, so no reason to initialize it. There is no way that error != 0 at the end of create_thread(). kern_sig.c: - Unused variable `code'. kern_synch.c: - `rval' is always assigned in all different cases. kern_rwlock.c: - `v' is always overwritten with RW_UNLOCKED further on. kern_malloc.c: - `size' is always initialized with the proper value before being used. kern_exit.c: - `error' is always caught and returned immediately. abort2() never returns a non-zero value. kern_exec.c: - `len' is always assigned inside the if-statement right below it. tty_info.c: - `td' is always overwritten by FOREACH_THREAD_IN_PROC(). Found by: LLVM's scan-build	2009-02-26 15:51:54 +00:00
Jeff Roberson	e20a199fd5	- Make the keg abstraction more complete. Permit a zone to have multiple backend kegs so it may source compatible memory from multiple backends. This is useful for cases such as NUMA or different layouts for the same memory type. - Provide a new api for adding new backend kegs to secondary zones. - Provide a new flag for adjusting the layout of zones to stagger allocations better across cache lines. Sponsored by: Nokia	2009-01-25 09:11:24 +00:00
Alan Cox	b89eaf4e9f	Enable the creation of a kmem map larger than 4GB. Submitted by: Tz-Huan Huang Make several variables related to kmem map auto-sizing static. Found by: CScout	2008-07-05 19:34:33 +00:00
Alan Cox	6819e13eeb	Correct an error in the comments for init_param3(). Discussed with: silby	2008-07-04 19:36:58 +00:00
John Birrell	91dd776cd2	Add support for the DTrace malloc provider which can enable probes on a per-malloc type basis.	2008-05-23 00:43:36 +00:00
Alan Cox	3202ed7523	Introduce a new parameter "superpage_align" to kmem_suballoc() that is used to request superpage alignment for the submap. Request superpage alignment for the kmem_map. Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find(). (They are currently equivalent but VMFS_ANY_SPACE is the new preferred spelling.) Remove a stale comment from kmem_malloc().	2008-05-10 21:46:20 +00:00
Robert Watson	237fdd787b	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink	2008-03-16 10:58:09 +00:00
Robert Watson	dc2e1e3fae	Use vm_offset_t for kmembase and kmemlimit rather than char *, avoiding unnecessary casts, and making it possible to compile kern_malloc.c with strict aliasing. Submitted by: rdivacky Approved by: re (kensmith)	2007-06-27 13:39:38 +00:00
Robert Watson	3805385e3d	Spell statistics more correctly in comments.	2007-06-14 03:02:33 +00:00
Attilio Rao	2feb50bf7d	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)	2007-05-31 22:52:15 +00:00
Robert Watson	e4e80aa713	Remove #if 0'd check for 0-size allocations, which if enabled, called kdb_enter().	2007-05-27 13:13:46 +00:00
Jeff Roberson	222d01951f	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>	2007-05-18 07:10:50 +00:00
Stephane E. Potvin	0e5179e441	Add support for specifying a minimal size for vm.kmem_size in the loader via vm.kmem_size_min. Useful when using ZFS to make sure that vm.kmem size will be at least 256mb (for example) without forcing a particular value via vm.kmem_size. Approved by: njl (mentor) Reviewed by: alc	2007-04-21 01:14:48 +00:00
Robert Watson	24076d138e	Increase usefulness of "show malloc" by moving from displaying the basic counters of allocs/frees/use for each malloc type to calculating InUse, MemUse, and Requests as displayed by the userspace vmstat -m. This is more useful when debugging malloc(9)-related memory leaks, where the count of allocs/frees may not usefully reflect that current memory allocation (i.e., when highly variable size allocations occur with the same malloc type, such as with contigmalloc). MFC after: 3 days Limitations observed by: scottl	2006-10-26 10:17:13 +00:00
Robert Watson	4b19d603c4	Remove old kern.malloc sysctl, which generated a text representation of the kernel malloc(9) state for vmstat -m. libmemstat is now used to generate a machine-readable version which is converged by vmstat -m into a human-readable version. Not for MFC.	2006-07-23 19:55:41 +00:00
Robert Watson	0ce3f16dbb	Expand comments for malloc(9) to better describe the design and statistics / memory types model.	2006-07-23 19:51:39 +00:00
Paul Saab	45d48bdad5	Fix bug in malloc_uninit(): Releasing items from the mt_zone can not be done by a simple uma_zfree() call since mt_zone is allocated with the UMA_ZONE_MALLOC flag. Use uma_zfree_arg instead and supply the slab. This bug caused panics in low memory situations on unloading kernel modules containing MALLOC_DEFINE(..) statements. Submitted by: ups	2006-03-03 22:36:52 +00:00
Pawel Jakub Dawidek	847a2a1716	Add buffer corruption protection (RedZone) for kernel's malloc(9). It detects both: buffer underflows and buffer overflows bugs at runtime (on free(9) and realloc(9)) and prints backtraces from where memory was allocated and from where it was freed. Tested by: kris	2006-01-31 11:09:21 +00:00
Pawel Jakub Dawidek	d362c40d3a	Improve memguard a bit: - Provide tunable vm.memguard.desc, so one can specify memory type without changing the code and recompiling the kernel. - Allow to use memguard for kernel modules by providing sysctl vm.memguard.desc, which can be changed to short description of memory type before module is loaded. - Move as much memguard code as possible to memguard.c. - Add sysctl node vm.memguard. and move memguard-specific sysctl there. - Add malloc_desc2type() function for finding memory type based on its short description (ks_shortdesc field). - Memory type can be changed (via vm.memguard.desc sysctl) only if it doesn't exist (will be loaded later) or when no memory is allocated yet. If there is allocated memory for the given memory type, return EBUSY. - Implement two ways of memory types comparsion and make safer/slower the default.	2005-12-30 11:45:07 +00:00
Pawel Jakub Dawidek	619f284195	In realloc(9), determine size of the original block based on UMA_SLAB_MALLOC flag. In some circumstances (I observed it when I was doing a lot of reallocs) UMA_SLAB_MALLOC can be set even if us_keg != NULL. If this is the case we have wonderful, silent data corruption, because less data is copied to the newly allocated region than should be. I'm not sure when this bug was introduced, it could be there undetected for years now, as we don't have a lot of realloc(9) consumers and it was hard to reproduce it... ...but what I know for sure, is that I don't want to know who introduce the bug:) It took me two/three days to track it down (of course most of the time I was looking for the bug in my own code).	2005-12-28 01:53:13 +00:00
Pawel Jakub Dawidek	2a143d5bf5	Detect memory leaks when memory type is being destroyed. This is very helpful for detecting kernel modules memory leaks on unload. Discussed and reviewed by: rwatson	2005-11-03 13:48:59 +00:00
Robert Watson	64a266f9e8	Change format string for u_int64_t to %ju from %llu, in order to use the correct format string on 64-bit systems. Pointed out by: pjd	2005-10-20 21:28:31 +00:00
Robert Watson	909ed16c2b	Add a "show malloc" command to DDB, which prints out the current stats for available kernel malloc types. Quite useful for post-mortem debugging of memory leaks without a dump device configured on a panicked box. MFC after: 2 weeks	2005-10-20 17:41:47 +00:00
Ruslan Ermilov	2319835713	Long overdue, keep up with mbuf.h,v 1.148.	2005-08-02 20:03:23 +00:00
Pawel Jakub Dawidek	73864adbd4	Fix the way how "InUse" column in 'vmstat -m' output works: - increase number of allocations count only on successfull malloc(9), so it doesn't confuse people; - because we need to check if 'size > 0', hide 'mtsp->mts_memalloced += size;' under the check as well, as for size=0 it is of course a no-op; - avoid critical_enter()/critical_exit() in case of failure in malloc_type_allocated() as there will be nothing to do. OK'ed by: rwatson MFC after: 2 days	2005-07-27 23:17:31 +00:00
Robert Watson	4f8721d2a9	Correct build on 64-bit: cast u_int64_t to (unsigned long long) before printfing as (unsigned long long). 32-bit build on i386 didn't notice this. Whoops. Reported by: arved Tested by: sledge	2005-07-14 15:21:18 +00:00
Robert Watson	cd814b2692	Introduce a new sysctl, kern.malloc_stats, which exports kernel malloc statistics via a binary structure stream: - Add structure 'malloc_type_stream_header', which defines a stream version, definition of MAXCPUS used in the stream, and a number of malloc_type records in the stream. - Add structure 'malloc_type_header', which defines the name of the malloc type being reported on. - When the sysctl is queried, return a stream header, followed by a series of type descriptions, each consisting of a type header followed by a series of MAXCPUS malloc_type_stats structures holding per-CPU allocation information. Typical values of MAXCPUS will be 1 (UP compiled kernel) and 16 (SMP compiled kernel). This query mechanism allows user space monitoring tools to extract memory allocation statistics in a machine-readable form, and to do so at a per-CPU granularity, allowing monitoring of allocation patterns across CPUs in order to better understand the distribution of work and memory flow over multiple CPUs. While here: - Bump statistics width to uint64_t, and hard code using fixed-width type in order to be more sure about structure layout in the stream. We allocate and free a lot of memory. - Add kmemcount, a counter of the number of registered malloc types, in order to avoid excessive manual counting of types. Export via a new sysctl to allow user-space code to better size buffers. - De-XXX comment on no longer maintaining the high watermark in old sysctl monitoring code. A follow-up commit of libmemstat(3), a library to monitor kernel memory allocation, will occur in the next few days. Likewise, similar changes to UMA.	2005-07-14 11:52:06 +00:00
Ken Smith	c0cac8dc20	Remove a variable that became unused as a result of changes made in v1.139. This was only exposed if MALLOC_PROFILE was defined. Submitted by: Gary Jennejohn Pointy hat: rwatson Approved by: re (scottl)	2005-06-16 16:01:46 +00:00
Joseph Koshy	8c61b21927	Fix typo. Reviewed by: rwatson, sam	2005-06-10 18:06:59 +00:00
Robert Watson	63a7e0a3f9	Kernel malloc layers malloc_type allocation over one of two underlying allocators: a set of power-of-two UMA zones for small allocations, and the VM page allocator for large allocations. In order to maintain unified statistics for specific malloc types, kernel malloc maintains a separate per-type statistics pool, which can be monitored using vmstat -m. Prior to this commit, each pool of per-type statistics was protected using a per-type mutex associated with the malloc type. This change modifies kernel malloc to maintain per-CPU statistics pools for each malloc type, and protects writing those statistics using critical sections. It also moves to unsynchronized reads of per-CPU statistics when generating coalesced statistics. To do this, several changes are implemented: - In the previous world order, the statistics memory was allocated by the owner of the malloc type structure, allocated statically using MALLOC_DEFINE(). This embedded the definition of the malloc_type structure into all kernel modules. Move to a model in which a pointer within struct malloc_type points at a UMA-allocated malloc_type_internal data structure owned and maintained by kern_malloc.c, and not part of the exported ABI/API to the rest of the kernel. For the purposes of easing a possible MFC, re-use an existing pointer in 'struct malloc_type', and maintain the current malloc_type structure size, as well as layout with respect to the fields reused outside of the malloc subsystem (such as ks_shortdesc). There are several unused fields as a result of no longer requiring the mutex in malloc_type. - Struct malloc_type_internal contains an array of malloc_type_stats, of size MAXCPU. The structure defined above avoids hard-coding a kernel compile-time value of MAXCPU into kernel modules that interact with malloc. - When accessing per-cpu statistics for a malloc type, surround read - modify - update requests with critical_enter()/critical_exit() in order to avoid races during write. The per-CPU fields are written only from the CPU that owns them. - Per-CPU stats now maintained "allocated" and "freed" counters for number of allocations/frees and bytes allocated/freed, since there is no longer a coherent global notion of the totals. When coalescing malloc stats, accept a slight race between reading stats across CPUs, and avoid showing the user a negative allocation count for the type in the event of a race. The global high watermark is no longer maintained for a malloc type, as there is no global notion of the number of allocations. - While tearing up the sysctl() path, also switch to using sbufs. The current "export as text" sysctl format is retained with the same syntax. We may want to change this in the future to export more per-CPU information, such as how allocations and frees are balanced across CPUs. This change results in a substantial speedup of kernel malloc and free paths on SMP, as critical sections (where usable) out-perform mutexes due to avoiding atomic/bus-locked operations. There is also a minor improvement on UP due to the slightly lower cost of critical sections there. The cost of the change to this approach is the loss of a continuous notion of total allocations that can be exploited to track per-type high watermarks, as well as increased complexity when monitoring statistics. Due to carefully avoiding changing the ABI, as well as hardening the ABI against future changes, it is not necessary to recompile kernel modules for this change. However, MFC'ing this change to RELENG_5 will require also MFC'ing optimizations for soft critical sections, which may modify exposed kernel ABIs. The internal malloc API is changed, and modifications to vmstat in order to restore "vmstat -m" on core dumps will follow shortly. Several improvements from: bde Statistics approach discussed with: ups Tested by: scottl, others	2005-05-29 13:38:07 +00:00
Robert Watson	87efd4d58a	Consistently style function declarations in kern_malloc.c. MFC after: 3 days	2005-04-12 23:54:34 +00:00
Bosko Milekic	e4eb384b47	Bring in MemGuard, a very simple and small replacement allocator designed to help detect tamper-after-free scenarios, a problem more and more common and likely with multithreaded kernels where race conditions are more prevalent. Currently MemGuard can only take over malloc()/realloc()/free() for particular (a) malloc type(s) and the code brought in with this change manually instruments it to take over M_SUBPROC allocations as an example. If you are planning to use it, for now you must: 1) Put "options DEBUG_MEMGUARD" in your kernel config. 2) Edit src/sys/kern/kern_malloc.c manually, look for "XXX CHANGEME" and replace the M_SUBPROC comparison with the appropriate malloc type (this might require additional but small/simple code modification if, say, the malloc type is declared out of scope). 3) Build and install your kernel. Tune vm.memguard_divisor boot-time tunable which is used to scale how much of kmem_map you want to allott for MemGuard's use. The default is 10, so kmem_size/10. ToDo: 1) Bring in a memguard(9) man page. 2) Better instrumentation (e.g., boot-time) of MemGuard taking over malloc types. 3) Teach UMA about MemGuard to allow MemGuard to override zone allocations too. 4) Improve MemGuard if necessary. This work is partly based on some old patches from Ian Dowse.	2005-01-21 18:09:17 +00:00
Warner Losh	9454b2d864	/* -> /*- for copyright notices, minor format tweaks as necessary	2005-01-06 23:35:40 +00:00
Dag-Erling Smørgrav	479439b4fe	Turn VM_KMEM_SIZE_MAX and VM_KMEM_SIZE_SCALE into tunables. MFC after: 3 days	2004-09-29 14:21:40 +00:00
Brian Feldman	4362fada8f	Reimplement contigmalloc(9) with an algorithm which stands a greatly- improved chance of working despite pressure from running programs. Instead of trying to throw a bunch of pages out to swap and hope for the best, only a range that can potentially fulfill contigmalloc(9)'s request will have its contents paged out (potentially, not forcibly) at a time. The new contigmalloc operation still operates in three passes, but it could potentially be tuned to more or less. The first pass only looks at pages in the cache and free pages, so they would be thrown out without having to block. If this is not enough, the subsequent passes page out any unwired memory. To combat memory pressure refragmenting the section of memory being laundered, each page is removed from the systems' free memory queue once it has been freed so that blocking later doesn't cause the memory laundered so far to get reallocated. The page-out operations are now blocking, as it would make little sense to try to push out a page, then get its status immediately afterward to remove it from the available free pages queue, if it's unlikely to have been freed. Another change is that if KVA allocation fails, the allocated memory segment will be freed and not leaked. There is a sysctl/tunable, defaulting to on, which causes the old contigmalloc() algorithm to be used. Nonetheless, I have been using vm.old_contigmalloc=0 for over a month. It is safe to switch at run-time to see the difference it makes. A new interface has been used which does not require mapping the allocated pages into KVA: vm_page.h functions vm_page_alloc_contig() and vm_page_release_contig(). These are what vm.old_contigmalloc=0 uses internally, so the sysctl/tunable does not affect their operation. When using the contigmalloc(9) and contigfree(9) interfaces, memory is now tracked with malloc(9) stats. Several functions have been exported from kern_malloc.c to allow other subsystems to use these statistics, as well. This invalidates the BUGS section of the contigmalloc(9) manpage.	2004-07-19 06:21:27 +00:00

1 2 3 4 5 ...

284 Commits