freebsd-dev/share/doc/papers/malloc/implementation.ms

.\"
.\" ----------------------------------------------------------------------------
.\" "THE BEER-WARE LICENSE" (Revision 42):
.\" <phk@login.dknet.dk> wrote this file.  As long as you retain this notice you
.\" can do whatever you want with this stuff. If we meet some day, and you think
.\" this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
.\" ----------------------------------------------------------------------------
.\"
.\" $Id$
.\"
.ds RH Implementation
.NH
Implementation
.PP
A new malloc(3) implementation was written to meet the goals,
and to the extent possible to address the shortcomings listed previously.
.PP
The source is 1218 lines of C code, and can be found in FreeBSD 2.2
(and probably later versions as well) as src/lib/libc/stdlib/malloc.c.
.PP
The main data structure is the
.I page-directory
which contains a
.B void*
for each page we have control over.
The value can be one of:
.IP
.B MALLOC_NOT_MINE
Another part of the code may call brk(2) to get a piece of the cake.
Consequently we cannot rely on the memory we get from the kernel to
be one consequtive piece of memory and therefore we need a way to
mark such pages as "untouchable".
.IP
.B MALLOC_FREE
This is a free page.
.IP
.B MALLOC_FIRST
This is the first page in a (multi-)page allocation.
.IP
.B MALLOC_FOLLOW
This is a subsequent page in a multi-page allocation.
.IP
.B
struct pginfo*
.R
A pointer to a structure describing a partitioned page.
.PP
In addition there exist a linked list of small data structures that
describe the free space as runs of free pages.
.PP
Notice that these structures are not part of the free pages themselves,
but rather allocated with malloc so that the free pages themselves
are never referenced while they are free.
.PP
When a request for storage comes in, it will be treated as a ``page''
allocation if it is bigger than half a page.
The freelist will be searched and the first run of free pages that
can satisfy the request is used.  The first page gets set to
.B MALLOC_FIRST
status, if more than that one page is needed the rest of them gets
.B MALLOC_FOLLOW
status in the page-directory.
.PP
If there were no pages on the free-list, brk(2) will be called, and
the pages will get added to the page-directory with status
.B MALLOC_FREE
and the search restarts.
.PP
Freeing a number of pages is done by changing their state in the 
page directory to MALLOC_FREE, and then traverse the free-pages list to
find the right place for this run of pages, possibly collapsing
with the two neighbouring runs into one run and, if it is possible,
release some memory back to the kernel by calling brk(2).
.PP
If the request is less than or equal to half of a page, its size will be
rounded up to the nearest power of two before being processed
and if the request is less than some minimum size, it is rounded up to
that size.
.PP
These sub-page allocations are served from pages which are split up
into some number of equal size chunks.
For each of these pages a
.B
struct pginfo
.R
describes the size of the chunks on this page, how many there are,
how many are free and so on.
The description consist of a bitmap of used chunks, and various counters
and numbers used to keep track of the stuff in the page.
.PP
For each size of sub-page allocation, the pginfo structures for the
pages that have free chunks in them form a list.
The head of these lists are stored in predetermined slots at
the beginning of the page directory to make access fast.
.PP
To allocate a chunk of some size, the head of the list for the
corresponding size is examined, and a free chunk found, the number
of free chunks on that page is decreased by one and if zero the
pginfo structure is unlinked from the list.
.PP
To free a chunk, the page is derived from the pointer, the page table
for that page contains a pointer to the pginfo structure, where the
free bit is set for the chunk, the number of free chunks increased by
one, and if equal to one, the pginfo structure is linked into the
proper place on the list for this size of chunks.
If the count increases to match the number of chunks on the page, the
pginfo structure is unlinked from the list and free(3)'ed and the 
actual page itself is free(3)'ed too.
.PP
To be 100% correct performance-wise these lists should be ordered
according to the recent number of accesses to that page.  This 
information is not available and it would essentially mean a reordering
of the list on every memory reference to keep it up-to-date.
Instead they are ordered according to the address of the pages.
Interestingly enough, in practice this comes out to almost the same 
thing performance wise.
.PP
It's not that surprising after all, it's the difference between
following the crowd or actively directing where it can go, in both
ways you can end up in the middle of it all.
.PP
The sideffect of this compromise is that it also uses less storage,
and the list never has to be reordered, all the ordering happens when
pages are added or deleted.
.PP
It is an interesting twist to the implementation that the
.B
struct pginfo
.R
Is allocated with malloc.
That is, "as with malloc" to be painfully correct.
The code knows the special case where the first (couple) of allocations on
the page is actually the pginfo structure and deals with it accordingly.
This avoids some silly "chicken and egg" issues.
.ds RH Bells and whistles.
.NH
Bells and whistles.
.PP
brk(2) is actually not a very fast system call when you ask for storage.
This is mainly because of the need by the kernel to zero the pages before
handing them over, so therefore this implementation does not release 
back heap-pages, until there is a large chunk to release back to the kernel.
Chances are pretty good that we will need it again pretty soon anyway.
Since these pages are not accessed at all, they will soon be paged out
and don't affect anything but swap-space usage.
.PP
The page directory is actually kept in a mmap(2)'ed piece of
anonymous memory.  This avoids some rather silly cases that
we would otherwise have to be handled when the page directory
has to be extended.
.PP
One particular nice feature is that all pointers passed to free(3)
and realloc(3) can be checked conclusively for validity:
First the pointer is masked to find the page.  The page directory
is then examined, it must contain either MALLOC_FIRST, in which
case the pointer must point exactly at the page, or it can contain
a struct pginfo*, in which case the pointer must point to a one of
the chunks described by that structure.
Warnings will be printed on stderr and nothing will be done with
the pointer in case it is found to be invalid.
.PP
An environment variable
.B MALLOC_OPTIONS
allows the user some control over the behaviour of malloc.
Some of the more interesting options are:
.IP
.B Abort
If malloc fails to allocate storage, core-dump the process with
a message rather than expect it handle this correctly.
It's amazing how few programs actually handle this condition correctly,
and consequently the havoc they can create is the more creative or
destructive.
.IP
.B Realloc
Always do a free and malloc when realloc(3) is called.  The default
is to leave things alone if the size of the allocation is still in
the same size-class.
For programs doing garbage collect using realloc(3) this make the
heap collapse faster.  Since the malloc will reallocate from the 
lowest available address.
.IP
.B Junk
will explicitly fill the allocated area with a particular value
to try to detect if programs rely on it being zero.
.IP
.B Zero
will explicitly zero out the allocated chunk of memory, while any
space after the allocation in the chunk will be filled with the
junk value to try to catch out of the chunk references.
.ds RH The road not taken.
.NH
The road yet not taken.
.PP
A couple of avenues were explored that could be interesting in some
set of circumstances.
.PP
Using mmap(2) instead of brk(2) was actually slower, since brk(2)
knows a lot of the things that mmap has to find out first.
.PP
A system call where we could tell the kernel that "we don't
need the contents of this page anymore" would allow us to
return the pages on the free list to the kernel and to instruct
the kernel that it doesn't need to page it out nor in.
It would save some page-out events, and the page-in would be replaced
by a zero-fill page.
This is, according to the VM goods in the FreeBSD camp, "easy",
and it will probably be attempted at some point in the future.
.PP
In general there is little room for further improvement of the
time-overhead of the malloc, further improvements will have to
be in the area of improving paging behaviour.
.PP
It is still under consideration to add a feature such that
if realloc is called with two zero arguments, the internal
allocations will be reallocated to perform a garbage collect.
This could be used in certain types of programs to collapse
the memory use, but so far it doesn't seem to be worth the effort.
.PP
Malloc/Free can be a significant point of contention in multi-threaded
programs.  Low-grain locking of the data-structures inside the 
implementation should be implemented to avoid excessive spin-waiting.
A little paper about phkmalloc. 1996-04-13 08:30:21 +00:00			`.\"`
			`.\" ----------------------------------------------------------------------------`
			`.\" "THE BEER-WARE LICENSE" (Revision 42):`
			`.\" <phk@login.dknet.dk> wrote this file. As long as you retain this notice you`
			`.\" can do whatever you want with this stuff. If we meet some day, and you think`
			`.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp`
			`.\" ----------------------------------------------------------------------------`
			`.\"`
			`.\" $Id$`
			`.\"`
			`.ds RH Implementation`
			`.NH`
			`Implementation`
			`.PP`
			`A new malloc(3) implementation was written to meet the goals,`
			`and to the extent possible to address the shortcomings listed previously.`
			`.PP`
			`The source is 1218 lines of C code, and can be found in FreeBSD 2.2`
			`(and probably later versions as well) as src/lib/libc/stdlib/malloc.c.`
			`.PP`
			`The main data structure is the`
			`.I page-directory`
			`which contains a`
			`.B void*`
			`for each page we have control over.`
			`The value can be one of:`
			`.IP`
			`.B MALLOC_NOT_MINE`
			`Another part of the code may call brk(2) to get a piece of the cake.`
			`Consequently we cannot rely on the memory we get from the kernel to`
			`be one consequtive piece of memory and therefore we need a way to`
			`mark such pages as "untouchable".`
			`.IP`
			`.B MALLOC_FREE`
			`This is a free page.`
			`.IP`
			`.B MALLOC_FIRST`
			`This is the first page in a (multi-)page allocation.`
			`.IP`
			`.B MALLOC_FOLLOW`
			`This is a subsequent page in a multi-page allocation.`
			`.IP`
			`.B`
			`struct pginfo*`
			`.R`
			`A pointer to a structure describing a partitioned page.`
			`.PP`
			`In addition there exist a linked list of small data structures that`
			`describe the free space as runs of free pages.`
			`.PP`
			`Notice that these structures are not part of the free pages themselves,`
			`but rather allocated with malloc so that the free pages themselves`
			`are never referenced while they are free.`
			`.PP`
			When a request for storage comes in, it will be treated as a ``page''
			`allocation if it is bigger than half a page.`
			`The freelist will be searched and the first run of free pages that`
			`can satisfy the request is used. The first page gets set to`
			`.B MALLOC_FIRST`
			`status, if more than that one page is needed the rest of them gets`
			`.B MALLOC_FOLLOW`
			`status in the page-directory.`
			`.PP`
			`If there were no pages on the free-list, brk(2) will be called, and`
			`the pages will get added to the page-directory with status`
			`.B MALLOC_FREE`
			`and the search restarts.`
			`.PP`
			`Freeing a number of pages is done by changing their state in the`
			`page directory to MALLOC_FREE, and then traverse the free-pages list to`
			`find the right place for this run of pages, possibly collapsing`
			`with the two neighbouring runs into one run and, if it is possible,`
			`release some memory back to the kernel by calling brk(2).`
			`.PP`
			`If the request is less than or equal to half of a page, its size will be`
			`rounded up to the nearest power of two before being processed`
			`and if the request is less than some minimum size, it is rounded up to`
			`that size.`
			`.PP`
			`These sub-page allocations are served from pages which are split up`
			`into some number of equal size chunks.`
			`For each of these pages a`
			`.B`
			`struct pginfo`
			`.R`
			`describes the size of the chunks on this page, how many there are,`
			`how many are free and so on.`
			`The description consist of a bitmap of used chunks, and various counters`
			`and numbers used to keep track of the stuff in the page.`
			`.PP`
			`For each size of sub-page allocation, the pginfo structures for the`
			`pages that have free chunks in them form a list.`
			`The head of these lists are stored in predetermined slots at`
			`the beginning of the page directory to make access fast.`
			`.PP`
			`To allocate a chunk of some size, the head of the list for the`
			`corresponding size is examined, and a free chunk found, the number`
			`of free chunks on that page is decreased by one and if zero the`
			`pginfo structure is unlinked from the list.`
			`.PP`
			`To free a chunk, the page is derived from the pointer, the page table`
			`for that page contains a pointer to the pginfo structure, where the`
			`free bit is set for the chunk, the number of free chunks increased by`
			`one, and if equal to one, the pginfo structure is linked into the`
			`proper place on the list for this size of chunks.`
			`If the count increases to match the number of chunks on the page, the`
			`pginfo structure is unlinked from the list and free(3)'ed and the`
			`actual page itself is free(3)'ed too.`
			`.PP`
			`To be 100% correct performance-wise these lists should be ordered`
			`according to the recent number of accesses to that page. This`
			`information is not available and it would essentially mean a reordering`
			`of the list on every memory reference to keep it up-to-date.`
			`Instead they are ordered according to the address of the pages.`
			`Interestingly enough, in practice this comes out to almost the same`
			`thing performance wise.`
			`.PP`
			`It's not that surprising after all, it's the difference between`
			`following the crowd or actively directing where it can go, in both`
			`ways you can end up in the middle of it all.`
			`.PP`
			`The sideffect of this compromise is that it also uses less storage,`
			`and the list never has to be reordered, all the ordering happens when`
			`pages are added or deleted.`
			`.PP`
			`It is an interesting twist to the implementation that the`
			`.B`
			`struct pginfo`
			`.R`
			`Is allocated with malloc.`
			`That is, "as with malloc" to be painfully correct.`
			`The code knows the special case where the first (couple) of allocations on`
			`the page is actually the pginfo structure and deals with it accordingly.`
			`This avoids some silly "chicken and egg" issues.`
			`.ds RH Bells and whistles.`
			`.NH`
			`Bells and whistles.`
			`.PP`
			`brk(2) is actually not a very fast system call when you ask for storage.`
			`This is mainly because of the need by the kernel to zero the pages before`
			`handing them over, so therefore this implementation does not release`
			`back heap-pages, until there is a large chunk to release back to the kernel.`
			`Chances are pretty good that we will need it again pretty soon anyway.`
			`Since these pages are not accessed at all, they will soon be paged out`
			`and don't affect anything but swap-space usage.`
			`.PP`
			`The page directory is actually kept in a mmap(2)'ed piece of`
			`anonymous memory. This avoids some rather silly cases that`
			`we would otherwise have to be handled when the page directory`
			`has to be extended.`
			`.PP`
			`One particular nice feature is that all pointers passed to free(3)`
			`and realloc(3) can be checked conclusively for validity:`
			`First the pointer is masked to find the page. The page directory`
			`is then examined, it must contain either MALLOC_FIRST, in which`
			`case the pointer must point exactly at the page, or it can contain`
			`a struct pginfo*, in which case the pointer must point to a one of`
			`the chunks described by that structure.`
			`Warnings will be printed on stderr and nothing will be done with`
			`the pointer in case it is found to be invalid.`
			`.PP`
			`An environment variable`
			`.B MALLOC_OPTIONS`
			`allows the user some control over the behaviour of malloc.`
			`Some of the more interesting options are:`
			`.IP`
			`.B Abort`
			`If malloc fails to allocate storage, core-dump the process with`
			`a message rather than expect it handle this correctly.`
			`It's amazing how few programs actually handle this condition correctly,`
			`and consequently the havoc they can create is the more creative or`
			`destructive.`
			`.IP`
			`.B Realloc`
			`Always do a free and malloc when realloc(3) is called. The default`
			`is to leave things alone if the size of the allocation is still in`
			`the same size-class.`
			`For programs doing garbage collect using realloc(3) this make the`
			`heap collapse faster. Since the malloc will reallocate from the`
			`lowest available address.`
			`.IP`
			`.B Junk`
			`will explicitly fill the allocated area with a particular value`
			`to try to detect if programs rely on it being zero.`
			`.IP`
			`.B Zero`
			`will explicitly zero out the allocated chunk of memory, while any`
			`space after the allocation in the chunk will be filled with the`
			`junk value to try to catch out of the chunk references.`
			`.ds RH The road not taken.`
			`.NH`
			`The road yet not taken.`
			`.PP`
			`A couple of avenues were explored that could be interesting in some`
			`set of circumstances.`
			`.PP`
			`Using mmap(2) instead of brk(2) was actually slower, since brk(2)`
			`knows a lot of the things that mmap has to find out first.`
			`.PP`
			`A system call where we could tell the kernel that "we don't`
			`need the contents of this page anymore" would allow us to`
			`return the pages on the free list to the kernel and to instruct`
			`the kernel that it doesn't need to page it out nor in.`
			`It would save some page-out events, and the page-in would be replaced`
			`by a zero-fill page.`
			`This is, according to the VM goods in the FreeBSD camp, "easy",`
			`and it will probably be attempted at some point in the future.`
			`.PP`
			`In general there is little room for further improvement of the`
			`time-overhead of the malloc, further improvements will have to`
			`be in the area of improving paging behaviour.`
			`.PP`
			`It is still under consideration to add a feature such that`
			`if realloc is called with two zero arguments, the internal`
			`allocations will be reallocated to perform a garbage collect.`
			`This could be used in certain types of programs to collapse`
			`the memory use, but so far it doesn't seem to be worth the effort.`
			`.PP`
			`Malloc/Free can be a significant point of contention in multi-threaded`
			`programs. Low-grain locking of the data-structures inside the`
			`implementation should be implemented to avoid excessive spin-waiting.`