performance.
- Always free to the alloc bucket if there is space. This gives LIFO
allocation order to improve hot-cache performance. This also allows
for zones with a single bucket per-cpu rather than a pair if the entire
working set fits in one bucket.
- Enable per-cpu caches of buckets. To prevent recursive bucket
allocation one bucket zone still has per-cpu caches disabled.
- Pick the initial bucket size based on a table driven maximum size
per-bucket rather than the number of items per-page. This gives
more sane initial sizes.
- Only grow the bucket size when we face contention on the zone lock, this
causes bucket sizes to grow more slowly.
- Adjust the number of items per-bucket to account for the header space.
This packs the buckets more efficiently per-page while making them
not quite powers of two.
- Eliminate the per-zone free bucket list. Always return buckets back
to the bucket zone. This ensures that as zones grow into larger
bucket sizes they eventually discard the smaller sizes. It persists
fewer buckets in the system. The locking is slightly trickier.
- Only switch buckets in zalloc, not zfree, this eliminates pathological
cases where we ping-pong between two buckets.
- Ensure that the thread that fills a new bucket gets to allocate from
it to give a better upper bound on allocation time.
Sponsored by: EMC / Isilon Storage Division
backing memory that is only a container for per-cpu caches of arbitrary
pointer items. These zones have no kegs.
- Convert the regular keg based allocator to use the new import/release
functions.
- Move some stats to be atomics since they would require excessive zone
locking/unlocking with the new import/release paradigm. Make
zone_free_item simpler now that callers can manage more stats.
- Check for these cache-only zones in the public APIs and debugging
code by checking zone_first_keg() against NULL.
Sponsored by: EMC / Isilong Storage Division
bitmap using sys/bitset. This is much simpler, has lower space
overhead and is cheaper in most cases.
- Use a second bitmap for invariants asserts and improve the quality of
the asserts as well as the number of erroneous conditions that we will
catch.
- Drastically simplify sizing code. Special case refcnt zones since they
will be going away.
- Update stale comments.
Sponsored by: EMC / Isilon Storage Division
for us_freecount.
This grows uma_slab_head on 32-bit arches, but growth isn't
significant. Taking kmem zones as example, only the 32 byte
zone is affected, ipers is reduced from 113 to 112.
In collaboration with: kib
These zones have slab size == sizeof(struct pcpu), but request from VM
enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such
zone would have a separate twin for each CPU in the system, and these twins
are at a distance of sizeof(struct pcpu) from each other. This magic value
of distance would allow us to make some optimizations later.
To address private item from a CPU simple arithmetics should be used:
item = (type *)((char *)base + sizeof(struct pcpu) * curcpu)
These arithmetics are available as zpcpu_get() macro in pcpu.h.
To introduce non-page size slabs a new field had been added to uma_keg
uk_slabsize. This shifted some frequently used fields of uma_keg to the
fourth cache line on amd64. To mitigate this pessimization, uma_keg fields
were a bit rearranged and least frequently used uk_name and uk_link moved
down to the fourth cache line. All other fields, that are dereferenced
frequently fit into first three cache lines.
Sponsored by: Nginx, Inc.
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.
The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().
Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.
All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.
Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks
uma_startup2() was called. Thus, setting the variable "booted" to true in
uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because
any allocations made after uma_startup() but before uma_startup2() could be
satisfied by uma_small_alloc(). Now, however, some multipage allocations
are necessary before uma_startup2() just to allocate zone structures on
machines with a large number of processors. Thus, a Boolean can no longer
effectively describe the state of the UMA allocator. Instead, make "booted"
have three values to describe how far initialization has progressed. This
allows multipage allocations to continue using startup_alloc() until
uma_startup2(), but single-page allocations may begin using
uma_small_alloc() after uma_startup().
2. With the aforementioned change, only a modest increase in boot pages is
necessary to boot UMA on a large number of processors.
3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between
r182028 and r204128.
Reviewed by: attilio [1], nwhitehorn [3]
Tested by: sbruno
of times the system was forced to sleep when requesting a new allocation.
Expand the debugger hook, db_show_uma, to display these results as well.
This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.
Reviewed by: rwatson alc
Approved by: scottl (mentor) peter
Obtained from: Yahoo Inc.
backend kegs so it may source compatible memory from multiple backends.
This is useful for cases such as NUMA or different layouts for the same
memory type.
- Provide a new api for adding new backend kegs to secondary zones.
- Provide a new flag for adjusting the layout of zones to stagger
allocations better across cache lines.
Sponsored by: Nokia
statistics via a binary structure stream:
- Add structure 'uma_stream_header', which defines a stream version,
definition of MAXCPUs used in the stream, and the number of zone
records in the stream.
- Add structure 'uma_type_header', which defines the name, alignment,
size, resource allocation limits, current pages allocated, preferred
bucket size, and central zone + keg statistics.
- Add structure 'uma_percpu_stat', which, for each per-CPU cache,
includes the number of allocations and frees, as well as the number
of free items in the cache.
- When the sysctl is queried, return a stream header, followed by a
series of type descriptions, each consisting of a type header
followed by a series of MAXCPUs uma_percpu_stat structures holding
per-CPU allocation information. Typical values of MAXCPU will be
1 (UP compiled kernel) and 16 (SMP compiled kernel).
This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.
While here, also export the number of UMA zones as a sysctl
vm.uma_count, in order to assist in sizing user swpace buffers to
receive the stream.
A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days. This change directly
supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats
rather than separately maintained mbuf allocator statistics.
MFC after: 1 week
mutexes, which offers lower overhead on both UP and SMP. When allocating
from or freeing to the per-cpu cache, without INVARIANTS enabled, we now
no longer perform any mutex operations, which offers a 1%-3% performance
improvement in a variety of micro-benchmarks. We rely on critical
sections to prevent (a) preemption resulting in reentrant access to UMA on
a single CPU, and (b) migration of the thread during access. In the event
we need to go back to the zone for a new bucket, we release the critical
section to acquire the global zone mutex, and must re-acquire the critical
section and re-evaluate which cache we are accessing in case migration has
occured, or circumstances have changed in the current cache.
Per-CPU cache statistics are now gathered lock-free by the sysctl, which
can result in small races in statistics reporting for caches.
Reviewed by: bmilekic, jeff (somewhat)
Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others
statement from some files, so re-add it for the moment, until the
related legalese is sorted out. This change affects:
sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h
- zone_large_init() stays pretty much the same.
- zone_small_init() will try to stash the slab header in the slab page
being allocated if the amount of calculated wasted space is less
than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular
case). If the amount of wasted space is >= UMA_MAX_WASTE, then
UMA_ZONE_OFFPAGE will be set and the slab header will be allocated
separately for better use of space.
- uma_startup() calculates the maximum ipers required in offpage slabs
(so that the offpage slab header zone(s) can be sized accordingly).
The algorithm used to calculate this replaces the old calculation
(which only happened to work coincidentally). We now iterate over
possible object sizes, starting from the smallest one, until we
determine that wastedspace calculated in zone_small_init() might
end up being greater than UMA_MAX_WASTE, at which point we use the
found object size to compute the maximum possible ipers. The
reason this works is because:
- wastedspace versus objectsize is a see-saw function with
local minima all equal to zero and local maxima growing
directly proportioned to objectsize. This implies that
for objects up to or equal a certain objectsize, the see-saw
remains entirely below UMA_MAX_WASTE, so for those objectsizes
it is impossible to ever go OFFPAGE for slab headers.
- ipers (items-per-slab) versus objectsize is an inversely
proportional function which falls off very quickly (very large
for small objectsizes).
- To determine the maximum ipers we'll ever need from OFFPAGE
slab headers we first find the largest objectsize for which
we are guaranteed to not go offpage for and use it to compute
ipers (as though we were offpage). Since the only objectsizes
allowed to go offpage are bigger than the found objectsize,
and since ipers vs objectsize is inversely proportional (and
monotonically decreasing), then we are guaranteed that the
ipers computed is always >= what we will ever need in offpage
slab headers.
- Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly
padded) size of each freelist index so that offset calculations are
fixed.
This might fix weird data corruption problems and certainly allows
ARM to now boot to at least single-user (via simulator).
Tested on i386 UP by me.
Tested on sparc64 SMP by fenner.
Tested on ARM simulator to single-user by cognet.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.
Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.
mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.
From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.
Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.
This change removes more code than it adds.
A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.
Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)
pmap_init(). Such a large preallocation is unnecessary and wastes
nearly eight megabytes of kernel virtual address space per gigabyte
of managed physical memory.
- Increase UMA_BOOT_PAGES by two. This enables the removal of
pmap_pv_allocf(). (Note: this function was only used during
initialization, specifically, after pmap_init() but before
pmap_init2(). During pmap_init2(), a new allocator is installed.)
working set cache. This has several advantages. Firstly, we never touch
the per cpu queues now in the timeout handler. This removes one more
reason for having per cpu locks. Secondly, it reduces the size of the zone
by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This
tidies up other logic as well.
- The 'destroy' flag no longer needs to be passed to zone_drain() since it
always frees everything in the zone's slabs.
- cache_drain() is now only called from zone_dtor() and so it destroys by
default. It also does not need the destroy parameter now.
broken consumers of the malloc interface who assume that the allocated
address will be an even multiple of the size.
- Remove disabled time delay code on uma_reclaim(). The comment there said
it all. It was not an effective strategy and it should not be left in
#if 0'd for all eternity.
by accepting the user supplied flags directly. Previously this was not
done so that flags for the same field would not be defined in two
different files. Add comments in each header instructing future
developers on how now to shoot their feet.
- Fix a test for !OFFPAGE which should have been a test for HASH. This would
have caused a panic if we had ever destructed a malloc zone. This also
opens up the possibility that other zones could use the vsetobj() method
rather than a hash.
don't cache as many items.
- Introduce the bucket_alloc(), bucket_free() functions to wrap bucket
allocation. These functions select the appropriate bucket zone to
allocate from or free to.
- Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects
the count of free items in the bucket. This gets rid of many unnatural
subtractions by 1 throughout the code.
- Add ub_entries which reflects the number of entries possibly held in a
bucket.
compare the zone element size (+1 for the byte of linkage) against
UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE.
Add a KASSERT in zone_small_init to make sure that the computed
ipers (items per slab) for the zone is not zero, despite the addition
of the check, just to be sure (this part submitted by: silby)
- UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies
CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the
case of bucket allocations, but in addition to that also ensures that
we don't setup the zone with OFFPAGE slab headers allocated from the
slabzone. This means that we're not allowed to have a UMA_ZONE_VM
zone initialized for large items (zone_large_init) because it would
require the slab headers to be allocated from slabzone, and hence
kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd
before kmem_map is suballoc'd from kernel_map, which is why this
change is necessary.
- In sysctl_vm_zone use the per cpu locks to read the current cache
statistics this makes them more accurate while under heavy load.
Submitted by: tegge
of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN *
sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact).
No Objections from jeff.
- Remove all instances of the mallochash.
- Stash the slab pointer in the vm page's object pointer when allocating from
the kmem_obj.
- Use the overloaded object pointer to find slabs for malloced memory.
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
allocated slabs and bucket caches for free items. It will not go ask the vm
for pages. This differs from M_NOWAIT in that it not only doesn't block, it
doesn't even ask.
- Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This
tells uma that it should only allocate buckets out of the bucket cache, and
not from the VM. It does this by using the M_NOVM option to zalloc when
getting a new bucket. This is so that the VM doesn't recursively enter
itself while trying to allocate buckets for vm_map_entry zones. If there
are already allocated buckets when we get here we'll still use them but
otherwise we'll skip it.
- Use the ZONE_VM flag on vm map entries and pv entries on x86.
mutex class. Currently this is only used for kmapentzone because kmapents
are are potentially allocated when freeing memory. This is not dangerous
though because no other allocations will be done while holding the
kmapentzone lock.