freebsd-dev

Author	SHA1	Message	Date
Jason Evans	baad859d16	Track dirty unused pages so that they can be purged if they exceed a threshold, according to the 'F' MALLOC_OPTIONS flag. This obsoletes the 'H' flag. Try to realloc() large objects in place. This substantially speeds up incremental large reallocations in the common case. Fix a bug in arena_ralloc() that caused relocation of sub-page objects even if the old and new sizes were in the same size class. Maintain trees of runs and simplify the per-chunk page map. This allows logarithmic-time searching for sufficiently large runs in arena_run_alloc(), whereas the previous algorithm required linear time in the worst case. Break various large functions into smaller sub-functions, and inline only the functions that are in the fast path for small object allocation/deallocation. Remove an unnecessary check in base_pages_alloc_mmap(). Avoid integer division in choose_arena() for the NO_TLS case on single-CPU systems.	2008-02-06 02:59:54 +00:00
Jason Evans	f38512f4af	Enable both sbrk(2)- and mmap(2)-based memory acquisition methods by default. This has the disadvantage of rendering the datasize resource limit irrelevant, but without this change, legitimate uses of more memory than will fit in the data segment are thwarted by default. Fix chunk_alloc_mmap() to work correctly if initial mapping is not chunk-aligned and mapping extension fails.	2008-01-03 23:22:13 +00:00
Jason Evans	36ac4cc502	Fix a major chunk-related memory leak in chunk_dealloc_dss_record(). [1] Clean up DSS-related locking and protect all pertinent variables with dss_mtx (remove dss_chunks_mtx). This fixes race conditions that could cause chunk leaks. Reported by: [1] kris	2007-12-31 06:19:48 +00:00
Jason Evans	07aa172f11	Fix a bug related to sbrk() calls that could cause address space leaks. This is a long-standing bug, but until recent changes it was difficult to trigger, and even then its impact was non-catastrophic, with the exception of revision 1.157. Optimize chunk_alloc_mmap() to avoid the need for unmapping pages in the common case. Thanks go to Kris Kennaway for a patch that inspired this change. Do not maintain a record of previously mmap'ed chunk address ranges. The original intent was to avoid the extra system call overhead in chunk_alloc_mmap(), which is no longer a concern. This also allows some simplifications for the tree of unused DSS chunks. Introduce huge_mtx and dss_chunks_mtx to replace chunks_mtx. There was no compelling reason to use the same mutex for these disjoint purposes. Avoid memset() for huge allocations when possible. Maintain two trees instead of one for tracking unused DSS address ranges. This allows scalable allocation of multi-chunk huge objects in the DSS. Previously, multi-chunk huge allocation requests failed if the DSS could not be extended.	2007-12-31 00:59:16 +00:00
Jason Evans	14a7e7b5e1	Back out premature commit of previous version.	2007-12-28 09:21:12 +00:00
Jason Evans	03947063d0	Maintain two trees instead of one (old_chunks --> old_chunks_{ad,szad}) in order to support re-use of multi-chunk unused regions within the DSS for huge allocations. This generalization is important to correct function when mmap-based allocation is disabled. Avoid zeroing re-used memory in the DSS unless it really needs to be zeroed.	2007-12-28 07:24:19 +00:00
Jason Evans	3762647250	Release chunks_mtx for all paths through chunk_dealloc(). Reported by: kris	2007-12-28 02:15:08 +00:00
Jason Evans	ebc87e7e0b	Add the 'D' and 'M' run time options, and use them to control whether memory is acquired from the system via sbrk(2) and/or mmap(2). By default, use sbrk(2) only, in order to support traditional use of resource limits. Additionally, when both options are enabled, prefer the data segment to anonymous mappings, in order to coexist better with large file mappings in applications on 32-bit platforms. This change has the potential to increase memory fragmentation due to the linear nature of the data segment, but from a performance perspective this is mitigated by the use of madvise(2). [1] Add the ability to interpret integer prefixes in MALLOC_OPTIONS processing. For example, MALLOC_OPTIONS=lllllllll can now be specified as MALLOC_OPTIONS=9l. Reported by: [1] rwatson Design review: [1] alc, peter, rwatson	2007-12-27 23:29:44 +00:00
Jason Evans	a0a474aed6	Use fixed point integer math instead of floating point math when calculating run sizes. Use of the floating point unit was a potential pessimization to context switching for applications that do not otherwise use floating point math. [1] Reformat cpp macro-related comments to improve consistency. Submitted by: das	2007-12-18 05:27:57 +00:00
Jason Evans	d55bd6236f	Refactor features a bit in order to make it possible to disable lazy deallocation and dynamic load balancing via the MALLOC_LAZY_FREE and MALLOC_BALANCE knobs. This is a non-functional change, since these features are still enabled when possible. Clean up a few things that more pedantic compiler settings would cause complaints over.	2007-12-17 01:20:04 +00:00
Jason Evans	7e42e29b9b	Only zero large allocations when necessary (for calloc()).	2007-11-28 00:17:34 +00:00
Jason Evans	5ea8413d0a	Implement dynamic load balancing of thread-->arena mapping, based on lock contention. The intent is to dynamically adjust to load imbalances, which can cause severe contention. Use pthread mutexes where possible instead of libc "spinlocks" (they aren't actually spin locks). Conceptually, this change is meant only to support the dynamic load balancing code by enabling the use of spin locks, but it has the added apparent benefit of substantially improving performance due to reduced context switches when there is moderate arena lock contention. Proper tuning parameter configuration for this change is a finicky business, and it is very much machine-dependent. One seemingly promising solution would be to run a tuning program during operating system installation that computes appropriate settings for load balancing. (The pthreads adaptive spin locks should probably be similarly tuned.)	2007-11-27 03:17:30 +00:00
Jason Evans	26b5e3a18e	Implement lazy deallocation of small objects. For each arena, maintain a vector of slots for lazily freed objects. For each deallocation, before doing the hard work of locking the arena and deallocating, try several times to randomly insert the object into the vector using atomic operations. This approach is particularly effective at reducing contention for multi-threaded applications that use the producer-consumer model, wherein one producer thread allocates objects, then multiple consumer threads deallocate those objects.	2007-11-27 03:13:15 +00:00
Jason Evans	bcd3523138	Avoid re-zeroing memory in calloc() when possible.	2007-11-27 03:12:15 +00:00
Jason Evans	1bbd1b8613	Fix stats printing of the amount of memory currently consumed by huge allocations. [1] Fix calculation of the number of arenas when 'n' is specified via MALLOC_OPTIONS. Clean up various style inconsistencies. Obtained from: [1] NetBSD	2007-11-27 03:09:23 +00:00
Jason Evans	76507741ab	Fix junk/zero filling for realloc(). Junk filling was missing in one case, and zero filling was broken in a way that could cause memory corruption. Update comments.	2007-06-15 22:00:16 +00:00
Jason Evans	d33f4690ba	Use size_t instead of unsigned for pagesize-related values, in order to avoid downcasting issues. In particular, this change fixes posix_memalign(3) for alignments greater than 2^31 on LP64 systems. Make sure that NDEBUG is always set to be compatible with MALLOC_DEBUG. [1] Reported by: [1] Lee Hyo geol <hyogeollee@gmail.com>	2007-03-29 21:07:17 +00:00
Jason Evans	eaf8d73212	Remove the run promotion/demotion machinery. Replace it with red-black trees that track all non-full runs for each bin. Use the red-black trees to be able to guarantee that each new allocation is placed in the lowest address available in any non-full run. This change completes the transition to allocating from low addresses in order to reduce the retention of sparsely used chunks. If the run in current use by a bin becomes empty, deallocate the run rather than retaining it for later use. The previous behavior had the tendency to spread empty runs across multiple chunks, thus preventing the release of chunks that were completely unused. Generalize base_chunk_alloc() (and rename it to base_pages_alloc()) to handle allocation sizes larger than the chunk size, so that it is possible to support chunk sizes that are smaller than an arena object. Reduce the minimum chunk size from 64kB to 8kB. Optimize tracking of addresses for deleted chunks. Fix a statistics bug for huge allocations.	2007-03-28 19:55:07 +00:00
Jason Evans	12fbf47cfb	Fix some subtle bugs for posix_memalign() having to do with integer rounding and overflow. Carefully document what the various overflow tests actually detect. The bugs mostly canceled out, such that the worst possible failure cases resulted in non-fatal over-allocations.	2007-03-24 20:44:06 +00:00
Jason Evans	e3da012f00	Fix posix_memalign() for large objects. Now that runs are extents rather than binary buddies, the alignment guarantees are weaker, which requires a more complex aligned allocation algorithm, similar to that used for alignment greater than the chunk size. Reported by: matteo	2007-03-23 22:58:15 +00:00
Jason Evans	bb99793a2b	Use extents rather than binary buddies to track free pages within chunks. This allows runs to be any multiple of the page size. The primary advantage is that large objects are no longer constrained to be 2^n pages, which can dramatically decrease internal fragmentation for large objects. This also allows the sizes for runs that back small objects to be more finely tuned. Free runs are searched for linearly using the chunk page map (with the help of some heuristic optimizations). This changes the allocation policy from "first best fit" to "first fit". A prototype red-black tree implementation for tracking free runs that implemented "first best fit" did not cause a measurable speed or memory usage difference for realistic chunk sizes (though of course it is possible to construct benchmarks that favor one allocation policy over another). Refine the handling of fullness constraints for small runs to be more tunable. Restructure the per chunk page map to contain only two fields per entry, rather than four. Also, increase each entry from 4 to 8 bytes, since it allows for 32-bit integers, without increasing the number of chunk header pages. Relax the maximum chunk size constraint. This is of no practical interest; it is merely fallout from the chunk page map restructuring. Revamp statistics gathering and reporting to be faster, clearer and more informative. Statistics gathering is fast enough now to have little to no impact on application speed, but it still requires approximately two extra pages of memory per arena (per process). This memory overhead may be acceptable for most systems, but we still need to leave statistics gathering disabled by default in RELENG branches. Rename NO_MALLOC_EXTRAS to MALLOC_PRODUCTION in order to make its intent clearer (i.e. it should be defined in RELENG branches).	2007-03-23 05:05:48 +00:00
Jason Evans	c9f0c8fd74	Avoid using vsnprintf(3) unless MALLOC_STATS is defined, in order to avoid substantial potential bloat for static binaries that do not otherwise use any printf(3)-family functions. [1] Rearrange arena_run_t so that the region bitmask can be minimally sized according to constraints related to each bin's size class. Previously, the region bitmask was the same size for all run headers, which wasted a measurable amount of memory. Rather than making runs for small objects as large as possible, make runs as small as possible such that header overhead stays below a certain bound. There are two exceptions that override the header overhead bound: 1) If the bound is impossible to honor, it is relaxed on a per-size-class basis. Since there is one bit of header overhead per object (plus a constant), it is impossible to achieve a header overhead less than or equal to 1/(# of bits per object). For the current setting of maximum 0.5% header overhead, this relaxation comes into play for {2, 4, 8, 16}-byte objects, for which header overhead is (on 64-bit systems) {7.1, 4.3, 2.2, 1.2}%, respectively. 2) There is still a cap on small run size, still set to 64kB. This comes into play for {1024, 2048}-byte objects, for which header overhead is {1.6, 3.1}%, respectively. In practice, this reduces the run sizes, which makes worst case low-water memory usage due to fragmentation less bad. It also reduces worst case high-water run fragmentation due to non-full runs, but this is only a constant improvement (most important to small short-lived processes). Reduce the default chunk size from 2MB to 1MB. Benchmarks indicate that the external fragmentation reduction makes 1MB the new sweet spot (as small as possible without adversely affecting performance). Reported by: [1] kientzle	2007-03-20 03:44:10 +00:00
Jason Evans	a326064e24	Modify chunk_alloc() to prefer mmap()ed memory over sbrk()ed memory. This has no impact unless USE_BRK is defined (32-bit platforms), in which case user allocations are allocated via mmap() if at all possible, in order to avoid the possibility of unreclaimable chunks in the data segment. Fix an obscure bug in base_alloc() that could have allowed undefined behavior if an application were to use sbrk() in conjunction with a USE_BRK-enabled malloc.	2007-02-22 19:10:30 +00:00
Jason Evans	38cc6e0a82	Fix a utrace(2)-related bug in calloc(3). Integrate various pedantic cleanups. Submitted by: Andrew Doran <ad@netbsd.org>	2007-01-31 22:54:19 +00:00
Jason Evans	ee0ab7cd86	Implement chunk allocation/deallocation hysteresis by caching one spare chunk per arena, rather than immediately deallocating all unused chunks. This fixes a potential performance issue when allocating/deallocating an object of size (4kB..1MB] in a loop. Reported by: davidxu	2006-12-23 00:18:51 +00:00
Jason Evans	820e03699c	Change the way base allocation is done for internal malloc data structures, in order to avoid the possibility of attempted recursive lock acquisition for chunks_mtx. Reported by: Slawa Olhovchenkov <slw@zxy.spb.ru>	2006-09-08 17:52:15 +00:00
Marcel Moolenaar	ce2dfbd199	Enable TLS on PowerPC.	2006-09-01 19:14:14 +00:00
Marcel Moolenaar	bc14049e96	Enable TLS on ia64.	2006-09-01 06:18:43 +00:00
Colin Percival	e981a4e863	Correctly handle the case in calloc(num, size) where (size_t)(num * size) == 0 but both num and size are nonzero. Reported by: Ilja van Sprundel Approved by: jasone Security: Integer overflow; calloc was allocating 1 byte in response to a request for a multiple of 2^32 (or 2^64) bytes instead of returning NULL.	2006-08-13 21:54:47 +00:00
Marcel Moolenaar	5011eea82f	Define NO_TLS on PowerPC. See also: PR ia64/91846	2006-08-09 19:01:27 +00:00
Jason Evans	b3dcb52814	Conditionally expand the size_invs lookup table in arena_run_reg_dalloc() so that architectures with a quantum of 8 (rather than 16) work. Restore arm's quantum to 8. Submitted by: jmg	2006-07-27 19:09:32 +00:00
Olivier Houchard	4cfa5e0135	Use 4 as QUANTUM_2POW_MIN on arm as it is on any other architecture, to avoid triggering an assertion later.	2006-07-27 14:36:28 +00:00
Jason Evans	b8f9774731	Fix cpp logic in arena_malloc() to adjust size when assertions are enabled, even if stats gathering is disabled. [1] Remove 'size' parameter from several functions that do not use it. Reported by: [1] ache	2006-07-27 04:00:12 +00:00
Jason Evans	5355c74026	Use some math tricks in arena_run_reg_dalloc() to avoid actual division, as well as avoiding a switch statement. This change has no significant impact to performance when branch prediction is successful at predicting the sizes of objects passed to free(), but in the case that the object sizes are semi-random, this change has the potential to prevent many branch prediction misses, thus improving performance substantially. Take advantage of alignment guarantees in ipalloc(), and pad object sizes to something less than a power of two when possible. This has the potential to substantially reduce internal fragmentation for objects allocated via posix_memalign(). Avoid an unnecessary pow2_ceil() call in arena_ralloc(). Submitted by: djam8193ah@hotmail.com	2006-07-01 16:51:10 +00:00
Jason Evans	00d8242c2b	Make the behavior of malloc(0) standards-compliant by getting rid of nil, and instead creating a small allocation for each malloc(0) call. The optional SysV compatibility behavior remains unchanged. Add a couple of assertions. Fix a couple of typos in error message strings.	2006-06-30 20:54:15 +00:00
Jason Evans	0fc8aff0c4	Add a missing case for the switch statement in arena_run_reg_dalloc(). [1] Fix a leak in chunk_dealloc(). [2] Reported by: [1] djam8193ah@hotmail.com, [2] Ville-Pertti Keinonen <will@exomi.com>	2006-06-20 20:38:25 +00:00
Jason Evans	3212b810d8	Increase the minimum chunk size by a power of two (32kB --> 64kB, assuming 4kB pages), in order to avoid dangerous rounding error when calculating fullness limits during run promotion/demotion. Convert a structure bitfield to a normal field in areana_run_t. This should have been changed along with the other fields in revision 1.120.	2006-05-10 00:07:45 +00:00
Jason Evans	f7768b9f34	Change the semantics of brk_max to dynamically deal with data segment bounds. [1] Modify logic for utilizing the data segment, such that it is possible to create huge allocations there. Shrink the data segment when deallocating a chunk, if it is at the end of the data segment. Rename chunk_size to csize in huge_malloc(), in order to avoid masking a static variable of the same name. [1] Reported by: Paul Allen <nospam@ugcs.caltech.edu>	2006-04-27 01:03:00 +00:00
Jason Evans	f90cbdf17f	Add an unreachable return statement, in order to avoid a compiler warning for non-standard optimization levels. Reported by: Michael Zach <zach@webges.com>	2006-04-05 18:46:24 +00:00
Jason Evans	50ff9670e2	Only initialize the first per-chunk page map element for free runs. This makes run split/coalesce operations of complexity lg(n) rather than n.	2006-04-05 04:15:12 +00:00
Jason Evans	cf01f0d7c5	Add init_lock, and use it to protect against allocator initialization races. This isn't currently necessary for libpthread or libthr, but without it external threads libraries like the linuxthreads port are not safe to use. Reported by: ganbold@micom.mng.net	2006-04-04 19:46:28 +00:00
Jason Evans	1c6d5bde6c	Refactor per-run bitmap manipulation functions so that bitmap offsets only have to be calculated once per allocator operation. Make nil const. Update various comments. Remove/avoid division where possible. For the one division operation that remains in the critical path, add a switch statement that has a case for each small size class, and do division with a constant divisor in each case. This allows the compiler to generate optimized code that does not use hardware division [1]. Obtained from: peter [1]	2006-04-04 03:51:47 +00:00
Jason Evans	cd70100e5d	Optimize runtime performance, primary using the following techniques: * Avoid choosing an arena until it's certain that an arena is needed for allocation. * Convert division/multiplication to bitshifting where possible. * Avoid accessing TLS variables in single-threaded code. * Reduce the amount of pointer dereferencing. * Move lock acquisition in critical paths to only protect the the code that requires synchronization, and completely remove locking where possible.	2006-03-30 20:25:52 +00:00
Jason Evans	6b2c15da6a	Add malloc_usable_size(3). Discussed with: arch@	2006-03-28 22:16:04 +00:00
Jason Evans	9f9bc9367c	Allow the 'n' option to decrease the number of arenas below the default, to as little as one arena. Also, limit the number of arenas to avoid a potential invariant violation in base_alloc().	2006-03-26 23:41:35 +00:00
Jason Evans	4328edf534	Add comments and reformat/rearrange code. There are no significant functional changes in this commit.	2006-03-26 23:37:25 +00:00
Jason Evans	0c21f9eda7	Convert TINY_MIN_2POW from a cpp macro to tiny_min_2pow (a variable), and determine its value at run time according to other relevant values. This avoids the creation of runs that are incompletely utilized, as long as pagesize isn't too large (>32kB, given the current RUN_MIN_REGS_2POW setting). Increase the size of several structure bitfields in arena_run_t in order to avoid integer overflow in the case that a run's header does not overlap with the space that is usable as application allocation regions. Given the tiny_min_2pow change, this fix has no additional impact unless pagesize is >32kB. Reported by: kris	2006-03-24 22:13:49 +00:00
Jason Evans	efafcfa7fb	Add USE_BRK-specific code in malloc_init_hard() to allow the first internally used chunk to start at the beginning of the heap, rather than at a chunk-aligned address. This reduces mapped memory somewhat for 32-bit architectures. Add the arena_run_link_t type and use it wherever a run object is only used as a ring 'header'. This saves approximately 40 kB of memory per arena. Remove an obsolete (no longer used) code path from base_alloc(), which supported the internal allocation of objects larger than the chunk size. Enhance chunk_dealloc() to cache chunk addresses for all deallocated chunks. This has no impact for most programs, but has the potential to reduce VM map fragmentation for programs that use huge allocations.	2006-03-24 00:28:08 +00:00
Jason Evans	c07ee180bc	Separate completely full runs from runs that are merely almost full, so that no linear searching is necessary if we resort to allocating from a run that is known to be mostly full. There are pathological edge cases that could have caused severely degraded performance, and this change fixes that.	2006-03-20 04:05:05 +00:00
Jason Evans	bd6a7799c4	Optimize realloc() to reallocate in place if the old and new sizes are close enough to each other that reallocation would allocate a new region of the same size. This improves the performance of repeated incremental reallocations by up to three orders of magnitude. [1] Fix arena_new() to properly constrain run size if a small chunk size was specified during runtime configuration. Suggested by: se [1]	2006-03-19 18:28:06 +00:00

1 2 3 4

162 Commits