Add thread-specific caching for small size classes, based on magazines.

This caching allows for completely lock-free allocation/deallocation in the
steady state, at the expense of likely increased memory use and
fragmentation.

Reduce the default number of arenas to 2*ncpus, since thread-specific
caching typically reduces arena contention.

Modify size class spacing to include ranges of 2^n-spaced, quantum-spaced,
cacheline-spaced, and subpage-spaced size classes.  The advantages are:
fewer size classes, reduced false cacheline sharing, and reduced internal
fragmentation for allocations that are slightly over 512, 1024, etc.

Increase RUN_MAX_SMALL, in order to limit fragmentation for the
subpage-spaced size classes.

Add a size-->bin lookup table for small sizes to simplify translating sizes
to size classes.  Include a hard-coded constant table that is used unless
custom size class spacing is specified at run time.

Add the ability to disable tiny size classes at compile time via
MALLOC_TINY.
This commit is contained in:
Jason Evans 2008-08-27 02:00:53 +00:00
parent 55b418d32b
commit d6742bfbd3
Notes: svn2git 2020-12-20 02:59:44 +00:00
svn path=/head/; revision=182225
5 changed files with 1141 additions and 254 deletions

View File

@ -157,6 +157,12 @@ void _set_tp(void *tp);
*/
extern const char *__progname;
/*
* This function is used by the threading libraries to notify malloc that a
* thread is exiting.
*/
void _malloc_thread_cleanup(void);
/*
* These functions are used by the threading libraries in order to protect
* malloc across fork().

View File

@ -93,6 +93,7 @@ FBSD_1.0 {
};
FBSDprivate_1.0 {
_malloc_thread_cleanup;
_malloc_prefork;
_malloc_postfork;
__system;

View File

@ -32,7 +32,7 @@
.\" @(#)malloc.3 8.1 (Berkeley) 6/4/93
.\" $FreeBSD$
.\"
.Dd February 17, 2008
.Dd August 26, 2008
.Dt MALLOC 3
.Os
.Sh NAME
@ -154,7 +154,7 @@ should not be depended on, since such behavior is entirely
implementation-dependent.
.Sh TUNING
Once, when the first call is made to one of these memory allocation
routines, various flags will be set or reset, which affect the
routines, various flags will be set or reset, which affects the
workings of this allocator implementation.
.Pp
The
@ -196,6 +196,11 @@ it should be to contention over arenas.
Therefore, some applications may benefit from increasing or decreasing this
threshold parameter.
This option is not available for some configurations (non-PIC).
.It C
Double/halve the size of the maximum size class that is a multiple of the
cacheline size (64).
Above this size, subpage spacing (256 bytes) is used for size classes.
The default value is 512 bytes.
.It D
Use
.Xr sbrk 2
@ -214,6 +219,16 @@ physical memory becomes scarce and the pages remain unused.
The default is 512 pages per arena;
.Ev MALLOC_OPTIONS=10f
will prevent any dirty unused pages from accumulating.
.It G
When there are multiple threads, use thread-specific caching for objects that
are smaller than one page.
This option is enabled by default.
Thread-specific caching allows many allocations to be satisfied without
performing any thread synchronization, at the cost of increased memory use.
See the
.Dq R
option for related tuning information.
This option is not available for some configurations (non-PIC).
.It J
Each byte of new memory allocated by
.Fn malloc ,
@ -248,7 +263,7 @@ option is implicitly enabled in order to assure that there is a method for
acquiring memory.
.It N
Double/halve the number of arenas.
The default number of arenas is four times the number of CPUs, or one if there
The default number of arenas is two times the number of CPUs, or one if there
is a single CPU.
.It P
Various statistics are printed at program exit via an
@ -259,14 +274,18 @@ while one or more threads are executing in the memory allocation functions.
Therefore, this option should only be used with care; it is primarily intended
as a performance tuning aid during application development.
.It Q
Double/halve the size of the allocation quantum.
The default quantum is the minimum allowed by the architecture (typically 8 or
16 bytes).
.It S
Double/halve the size of the maximum size class that is a multiple of the
quantum.
Above this size, power-of-two spacing is used for size classes.
The default value is 512 bytes.
quantum (8 or 16 bytes, depending on architecture).
Above this size, cacheline spacing is used for size classes.
The default value is 128 bytes.
.It R
Double/halve magazine size, which approximately doubles/halves the number of
rounds in each magazine.
Magazines are used by the thread-specific caching machinery to acquire and
release objects in bulk.
Increasing the magazine size decreases locking overhead, at the expense of
increased memory usage.
This option is not available for some configurations (non-PIC).
.It U
Generate
.Dq utrace
@ -358,6 +377,13 @@ improve performance, mainly due to reduced cache performance.
However, it may make sense to reduce the number of arenas if an application
does not make much use of the allocation functions.
.Pp
In addition to multiple arenas, this allocator supports thread-specific
caching for small objects (smaller than one page), in order to make it
possible to completely avoid synchronization for most small allocation requests.
Such caching allows very fast allocation in the common case, but it increases
memory usage and fragmentation, since a bounded number of objects can remain
allocated in each thread cache.
.Pp
Memory is conceptually broken into equal-sized chunks, where the chunk size is
a power of two that is greater than the page size.
Chunks are always aligned to multiples of the chunk size.
@ -366,7 +392,7 @@ quickly.
.Pp
User objects are broken into three categories according to size: small, large,
and huge.
Small objects are no larger than one half of a page.
Small objects are smaller than one page.
Large objects are smaller than the chunk size.
Huge objects are a multiple of the chunk size.
Small and large objects are managed by arenas; huge objects are managed
@ -378,23 +404,24 @@ Each chunk that is managed by an arena tracks its contents as runs of
contiguous pages (unused, backing a set of small objects, or backing one large
object).
The combination of chunk alignment and chunk page maps makes it possible to
determine all metadata regarding small and large allocations in
constant and logarithmic time, respectively.
determine all metadata regarding small and large allocations in constant time.
.Pp
Small objects are managed in groups by page runs.
Each run maintains a bitmap that tracks which regions are in use.
Allocation requests that are no more than half the quantum (see the
.Dq Q
option) are rounded up to the nearest power of two (typically 2, 4, or 8).
Allocation requests that are no more than half the quantum (8 or 16, depending
on architecture) are rounded up to the nearest power of two.
Allocation requests that are more than half the quantum, but no more than the
maximum quantum-multiple size class (see the
.Dq S
minimum cacheline-multiple size class (see the
.Dq Q
option) are rounded up to the nearest multiple of the quantum.
Allocation requests that are larger than the maximum quantum-multiple size
class, but no larger than one half of a page, are rounded up to the nearest
power of two.
Allocation requests that are larger than half of a page, but small enough to
fit in an arena-managed chunk (see the
Allocation requests that are more than the minumum cacheline-multiple size
class, but no more than the minimum subpage-multiple size class (see the
.Dq C
option) are rounded up to the nearest multiple of the cacheline size (64).
Allocation requests that are more than the minimum subpage-multiple size class
are rounded up to the nearest multiple of the subpage size (256).
Allocation requests that are more than one page, but small enough to fit in
an arena-managed chunk (see the
.Dq K
option), are rounded up to the nearest run size.
Allocation requests that are too large to fit in an arena-managed chunk are
@ -402,8 +429,8 @@ rounded up to the nearest multiple of the chunk size.
.Pp
Allocations are packed tightly together, which can be an issue for
multi-threaded applications.
If you need to assure that allocations do not suffer from cache line sharing,
round your allocation requests up to the nearest multiple of the cache line
If you need to assure that allocations do not suffer from cacheline sharing,
round your allocation requests up to the nearest multiple of the cacheline
size.
.Sh DEBUGGING MALLOC PROBLEMS
The first thing to do is to set the

File diff suppressed because it is too large Load Diff

View File

@ -36,6 +36,7 @@
#include <pthread.h>
#include "un-namespace.h"
#include "libc_private.h"
#include "thr_private.h"
void _pthread_exit(void *status);
@ -95,6 +96,9 @@ _pthread_exit(void *status)
_thread_cleanupspecific();
}
/* Tell malloc that the thread is exiting. */
_malloc_thread_cleanup();
if (!_thr_isthreaded())
exit(0);