Commit Graph

133 Commits

Author SHA1 Message Date
Justin Hibbits
12f691959f Align UMA data to 128 byte cacheline size
Suggested by:	mjg
2018-06-04 15:44:17 +00:00
Mateusz Guzik
782e38aa48 uma: increase alignment to 128 bytes on amd64
Current UMA internals are not suited for efficient operation in
multi-socket environments. In particular there is very common use of
MAXCPU arrays and other fields which are not always properly aligned and
are not local for target threads (apart from the first node of course).
Turns out the existing UMA_ALIGN macro can be used to mostly work around
the problem until the code get fixed. The current setting of 64 bytes
runs into trouble when adjacent cache line prefetcher gets to work.

An example 128-way benchmark doing a lot of malloc/frees has the following
instruction samples:

before:
kernel`lf_advlockasync+0x43b            32940
          kernel`malloc+0xe5            42380
           kernel`bzero+0x19            47798
   kernel`spinlock_exit+0x26            60423
         kernel`0xffffffff80            78238
                         0x0           136947
   kernel`uma_zfree_arg+0x46           159594
 kernel`uma_zalloc_arg+0x672           180556
   kernel`uma_zfree_arg+0x2a           459923
 kernel`uma_zalloc_arg+0x5ec           489910

after:
            kernel`bzero+0xd            46115
kernel`lf_advlockasync+0x25f            46134
kernel`lf_advlockasync+0x38a            49078
   kernel`fget_unlocked+0xd1            49942
kernel`lf_advlockasync+0x43b            55392
          kernel`copyin+0x4a            56963
           kernel`bzero+0x19            81983
   kernel`spinlock_exit+0x26            91889
         kernel`0xffffffff80           136357
                         0x0           239424

See the review for more details.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D15346
2018-05-11 07:04:57 +00:00
Gleb Smirnoff
5073a08328 Fix three miscalculations in amount of boot pages:
o Most of startup zones have struct uma_slab embedded into the slab,
  so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
  when calculating how many pages would certain kind of allocations
  require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
  need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
  and this also needs to be accounted for. [2]

While here, spread more comments and improve diagnostic messages.

Reported by:	pho [1], jtl [2]
2018-02-07 18:32:51 +00:00
Gleb Smirnoff
f4bef67c9c Followup on r302393 by cperciva, improving calculation of boot pages required
for UMA startup.

o Introduce another stage of UMA startup, which is entered after
  vm_page_startup() finishes. After this stage we don't yet enable buckets,
  but we can ask VM for pages. Rename stages to meaningful names while here.
  New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
  BOOT_RUNNING.
  Enabling page alloc earlier allows us to dramatically reduce number of
  boot pages required. What is more important number of zones becomes
  consistent across different machines, as no MD allocations are done before
  the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
  startup_alloc(), however that may change, so vm_page_startup() provides
  its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
  functions calculates sizes of zones zone and kegs zone, and calculates how
  many pages UMA will need to bootstrap.
  It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
  mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
  but also as args.size.

Reviewed by:		imp, gallatin (earlier version)
Differential Revision:	https://reviews.freebsd.org/D14054
2018-02-06 04:16:00 +00:00
Jeff Roberson
ab3185d15e Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
Jeff Roberson
ad5b0f5b51 Fix arc after r326347 broke various memory limit queries. Use UMA features
rather than kmem arena size to determine available memory.

Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
the real limit is set.

PR:		224330 (partial), 224080
Reviewed by:	markj, avg
Sponsored by:	Netflix / Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13494
2018-01-02 04:35:56 +00:00
Jeff Roberson
2e47807c21 Eliminate kmem_arena and kmem_object in preparation for further NUMA commits.
The arena argument to kmem_*() is now only used in an assert.  A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.

This replaces the hard limit on kmem size with a soft limit imposed by UMA.  When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA.  On 32bit architectures this should behave much more
gracefully as we exhaust KVA.  On 64bit the limits are likely never hit.

Reviewed by:	markj, kib (some objections)
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix / Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13187
2017-11-28 23:40:54 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Mark Johnston
e04223bf94 Include _bitset.h to get BITSET_DEFINE, used to define struct slabbits.
MFC after:	1 week
2017-09-15 14:59:35 +00:00
Mark Johnston
2d54d4bb9f Widen uk_pgoff, the slab header offset field.
16 bits is only wide enough for kegs with an item size of up to 64KB.
At that size or larger, slab headers are typically offpage because the
item size is a multiple of the page size, but there is no requirement
that this be the case.

We can widen the field without affecting the layout of struct uma_keg
since the removal of uk_slabsize in r315077 left an adjacent hole.

PR:		218911
MFC after:	2 weeks
2017-09-13 21:54:37 +00:00
Andriy Gapon
a55ebb7cd5 uma: eliminate uk_slabsize field
The field was not used beyond the initial keg setup stage anyway.

MFC after:	1 month (if ever)
2017-03-11 16:35:36 +00:00
Colin Percival
34caa842a4 Autotune the number of pages set aside for UMA startup based on the number
of CPUs present.  On amd64 this unbreaks the boot for systems with 92 or
more CPUs; the limit will vary on other systems depending on the size of
their uma_zone and uma_cache structures.

The major consumer of pages during UMA startup is the 19 zone structures
which are set up before UMA has bootstrapped itself sufficiently to use
the rest of the available memory:  UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 /
16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP,
KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg.  If the zone structures occupy
more than one page, they will not share pages and the number of pages
currently needed for startup is 19 * pages_per_zone + N, where N is the
number of pages used for allocating other structures; on amd64 N = 3 at
present (2 pages are allocated for UMA Kegs, and one page for UMA Hash).

This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32,
and if a zone structure does not fit into a single page sets boot_pages to
UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which
remains at 64).  Consequently this patch has no effect on systems where the
zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but
increases boot_pages sufficiently on systems where the large number of CPUs
makes this structure larger.  It seems safe to assume that systems with 60+
CPUs can afford to set aside an additional 128kB of memory per 32 CPUs.

The vm.boot_pages tunable continues to override this computation, but is
unlikely to be necessary in the future.

Tested on:	EC2 x1.32xlarge
Relnotes:	FreeBSD can now boot on 92+ CPU systems without requiring
		vm.boot_pages to be manually adjusted.
Reviewed by:	jeff, alc, adrian
Approved by:	re (kib)
2016-07-07 18:37:12 +00:00
Pedro F. Giffuni
763df3ec55 sys/vm: minor spelling fixes in comments.
No functional change.
2016-05-02 20:16:29 +00:00
Gleb Smirnoff
cfcae3f86f Remove UMA_ZONE_REFCNT feature, now unused.
Blessed by:	jeff
2016-03-01 00:33:32 +00:00
Gleb Smirnoff
b28cc462ad Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by:	jhb
2016-02-09 20:22:35 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Jonathan T. Looney
54503a13d8 Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
Scott Long
43ffa928f9 Revert r281451. It causes a panic/hang early in boot for a number of
users, myself included.  The original code is likely papering over a
larger bug that needs to be explored, but for now get things back to
a working state.

Obtained from:	Netflix, Inc.
MFC after:	immediately
2015-04-24 17:03:53 +00:00
Dmitry Chagin
51cfb0be84 Rework r281162. Indeed, the flexible array member is preferable here.
Suggested by:   Justin T. Gibbs

MFC after:	3 days
2015-04-12 06:21:58 +00:00
Ryan Stone
f2c2231e0c Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
Alexander Motin
ace66b568e Implement soft pressure on UMA cache bucket sizes.
Every time system detects low memory condition decrease bucket sizes for
each zone by one item.  As result, higher memory pressure will push to
smaller bucket sizes and so smaller per-CPU caches and so more efficient
memory use.

Before this change there was no force to oppose buckets growth as result
of practically inevitable zone lock conflicts, and after some run time
per-CPU caches could consume enough RAM to kill the system.
2013-11-19 10:05:53 +00:00
Konstantin Belousov
9eab548476 PG_SLAB no longer serves a useful purpose, since m->object is no
longer abused to store pointer to slab. Remove it.

Reviewed by:    alc
Sponsored by:   The FreeBSD Foundation
Approved by:	re (hrs)
2013-09-17 07:35:26 +00:00
Konstantin Belousov
c325e866f4 Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue.  Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked.  See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members.  Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-10 17:36:42 +00:00
Gleb Smirnoff
dab12c75bb Since r251709 a slab no longer use 8-bit indicies to manage items,
thus remove a stale comment.

Reviewed by:	jeff
2013-07-24 06:13:00 +00:00
Jeff Roberson
6fd34d6f67 - Resolve bucket recursion issues by passing a cookie with zone flags
through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg().
 - Make some smaller buckets for large zones to further reduce memory
   waste.
 - Implement uma_zone_reserve().  This holds aside a number of items only
   for callers who specify M_USE_RESERVE.  buckets will never be filled
   from reserve allocations.

Sponsored by:	EMC / Isilon Storage Division
2013-06-26 00:57:38 +00:00
Jeff Roberson
af5263743c - Add a per-zone lock for zones without kegs.
- Be more explicit about zone vs keg locking.  This functionally changes
   almost nothing.
 - Add a size parameter to uma_zcache_create() so we can size the buckets.
 - Pass the zone to bucket_alloc() so it can modify allocation flags
   as appropriate.
 - Fix a bug in zone_alloc_bucket() where I missed an address of operator
   in a failure case.  (Found by pho)

Sponsored by:	EMC / Isilon Storage Division
2013-06-20 19:08:12 +00:00
Jeff Roberson
fc03d22b17 Refine UMA bucket allocation to reduce space consumption and improve
performance.

 - Always free to the alloc bucket if there is space.  This gives LIFO
   allocation order to improve hot-cache performance.  This also allows
   for zones with a single bucket per-cpu rather than a pair if the entire
   working set fits in one bucket.
 - Enable per-cpu caches of buckets.  To prevent recursive bucket
   allocation one bucket zone still has per-cpu caches disabled.
 - Pick the initial bucket size based on a table driven maximum size
   per-bucket rather than the number of items per-page.  This gives
   more sane initial sizes.
 - Only grow the bucket size when we face contention on the zone lock, this
   causes bucket sizes to grow more slowly.
 - Adjust the number of items per-bucket to account for the header space.
   This packs the buckets more efficiently per-page while making them
   not quite powers of two.
 - Eliminate the per-zone free bucket list.  Always return buckets back
   to the bucket zone.  This ensures that as zones grow into larger
   bucket sizes they eventually discard the smaller sizes.  It persists
   fewer buckets in the system.  The locking is slightly trickier.
 - Only switch buckets in zalloc, not zfree, this eliminates pathological
   cases where we ping-pong between two buckets.
 - Ensure that the thread that fills a new bucket gets to allocate from
   it to give a better upper bound on allocation time.

Sponsored by:	EMC / Isilon Storage Division
2013-06-18 04:50:20 +00:00
Jeff Roberson
0095a78419 - Add a new UMA API: uma_zcache_create(). This makes a zone without any
backing memory that is only a container for per-cpu caches of arbitrary
   pointer items.  These zones have no kegs.
 - Convert the regular keg based allocator to use the new import/release
   functions.
 - Move some stats to be atomics since they would require excessive zone
   locking/unlocking with the new import/release paradigm.  Make
   zone_free_item simpler now that callers can manage more stats.
 - Check for these cache-only zones in the public APIs and debugging
   code by checking zone_first_keg() against NULL.

Sponsored by:	EMC / Isilong Storage Division
2013-06-17 03:43:47 +00:00
Jeff Roberson
ef72505e6d - Convert the slab free item list from a linked array of indices to a
bitmap using sys/bitset.  This is much simpler, has lower space
   overhead and is cheaper in most cases.
 - Use a second bitmap for invariants asserts and improve the quality of
   the asserts as well as the number of erroneous conditions that we will
   catch.
 - Drastically simplify sizing code.  Special case refcnt zones since they
   will be going away.
 - Update stale comments.

Sponsored by:	EMC / Isilon Storage Division
2013-06-13 21:05:38 +00:00
Gleb Smirnoff
85dcf349c1 Convert UMA code to C99 uintXX_t types. 2013-04-09 17:43:48 +00:00
Gleb Smirnoff
04fc5741e0 Swap us_freecount and us_flags, achieving same structure size
as before previous commit.

Submitted by:	alc
2013-04-09 17:25:15 +00:00
Gleb Smirnoff
8cf455b8d9 Since now we support 256 items per slab, we need more bits
for us_freecount.

This grows uma_slab_head on 32-bit arches, but growth isn't
significant. Taking kmem zones as example, only the 32 byte
zone is affected, ipers is reduced from 113 to 112.

In collaboration with:	kib
2013-04-09 15:15:52 +00:00
Gleb Smirnoff
ad97af7ebd Merge from projects/counters: UMA_ZONE_PCPU zones.
These zones have slab size == sizeof(struct pcpu), but request from VM
enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such
zone would have a separate twin for each CPU in the system, and these twins
are at a distance of sizeof(struct pcpu) from each other. This magic value
of distance would allow us to make some optimizations later.

  To address private item from a CPU simple arithmetics should be used:

  item = (type *)((char *)base + sizeof(struct pcpu) * curcpu)

  These arithmetics are available as zpcpu_get() macro in pcpu.h.

  To introduce non-page size slabs a new field had been added to uma_keg
uk_slabsize. This shifted some frequently used fields of uma_keg to the
fourth cache line on amd64. To mitigate this pessimization, uma_keg fields
were a bit rearranged and least frequently used uk_name and uk_link moved
down to the fourth cache line. All other fields, that are dereferenced
frequently fit into first three cache lines.

Sponsored by:	Nginx, Inc.
2013-04-08 19:10:45 +00:00
Attilio Rao
a4915c21d9 Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva().  The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ.  More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
  allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
  serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
  by uma_small_alloc() rather than relying on the KVA / offset
  combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed.  This function is replaced by
   direct calls to mtx_init() as there is no need to export it anymore
   and the calls aren't either homogeneous anymore: there are now small
   differences between arguments passed to mtx_init().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (which also offered almost all the comments)
Tested by:	pho, jhb, davide
2013-02-26 23:35:27 +00:00
Gleb Smirnoff
936c747be0 Comment fix: there is no ub_ptr, instead explain meaning of uz_count
field verbally.
2012-12-21 10:09:45 +00:00
Pawel Jakub Dawidek
2f891cd504 Implemented uma_zone_set_warning(9) function that sets a warning, which
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.

All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.

Discussed on:	arch
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-12-07 22:27:13 +00:00
Matthew D Fleming
bb196eb480 Const-ify the zone name argument to uma_zcreate(9).
MFC after:	3 days
2012-10-26 17:51:05 +00:00
Alan Cox
342f1793ba 1. Prior to r214782, UMA did not support multipage allocations before
uma_startup2() was called.  Thus, setting the variable "booted" to true in
uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because
any allocations made after uma_startup() but before uma_startup2() could be
satisfied by uma_small_alloc().  Now, however, some multipage allocations
are necessary before uma_startup2() just to allocate zone structures on
machines with a large number of processors.  Thus, a Boolean can no longer
effectively describe the state of the UMA allocator.  Instead, make "booted"
have three values to describe how far initialization has progressed.  This
allows multipage allocations to continue using startup_alloc() until
uma_startup2(), but single-page allocations may begin using
uma_small_alloc() after uma_startup().

2. With the aforementioned change, only a modest increase in boot pages is
necessary to boot UMA on a large number of processors.

3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM.  It has only been used between
r182028 and r204128.

Reviewed by:	attilio [1], nwhitehorn [3]
Tested by:	sbruno
2011-05-21 17:43:43 +00:00
Alan Cox
59d7277f4a Fix spelling errors. 2011-05-20 17:28:00 +00:00
Sean Bruno
bf96595915 Add a new column to the output of vmstat -z to indicate the number
of times the system was forced to sleep when requesting a new allocation.

Expand the debugger hook, db_show_uma, to display these results as well.

This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.

Reviewed by:	rwatson alc
Approved by:	scottl (mentor) peter
Obtained from:	Yahoo Inc.
2010-06-15 19:28:37 +00:00
Kip Macy
1a23373ceb - enable alignment on amd64 only
- only align pcpu caches and the volatile portion of uma_zone
2010-03-22 22:39:32 +00:00
Kip Macy
6b4391d786 turn 205266 in to a no-op until the problem can be properly diagnosed 2010-03-18 20:30:25 +00:00
Kip Macy
5e4bb93cca Cache line align various structures and move volatile counters to
not share a cache line with (mostly) immutable state

Reviewed by:	jeff@
MFC after:	7 days
2010-03-17 21:18:28 +00:00
Antoine Brodin
4e2d83fc4a Remove trailing ";" in UMA_HASH_INSERT and UMA_HASH_REMOVE macros.
MFC after:	1 month
2009-12-05 17:45:56 +00:00
Jeff Roberson
e20a199fd5 - Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
   This is useful for cases such as NUMA or different layouts for the same
   memory type.
 - Provide a new api for adding new backend kegs to secondary zones.
 - Provide a new flag for adjusting the layout of zones to stagger
   allocations better across cache lines.

Sponsored by:	Nokia
2009-01-25 09:11:24 +00:00
Robert Watson
6ab3b958fc Update stale comment on protecting UMA per-CPU caches: we now use
critical sections rather than mutexes.
2007-05-09 22:53:34 +00:00
Robert Watson
af17e9a922 Wrap inlines in uma_int.h in #ifdef _KERNEL so that uma_int.h can be
used from memstat_uma.c for the purposes of kvm access without lots
of additional unsafe includes.

MFC after:	3 days
2005-08-04 10:03:53 +00:00
Robert Watson
08ecce74bc Improve canonicalization of copyrights. Order copyrights by order of
assertion (jeff, bmilekic, rwatson).

Suggested ages ago by:	bde
MFC after:		1 week
2005-07-16 09:51:52 +00:00
Mike Silbersack
2018f30c01 Increase the flags field for kegs from a 16 to a 32 bit value;
we have exhausted all 16 flags.
2005-07-16 02:23:41 +00:00
Robert Watson
2019094a35 Track UMA(9) allocation failures by zone, and export via sysctl.
Requested by:	victor cruceru <victor dot cruceru at gmail dot com>
MFC after:	1 week
2005-07-15 23:34:39 +00:00
Robert Watson
7a52a97eb3 Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator
statistics via a binary structure stream:

- Add structure 'uma_stream_header', which defines a stream version,
  definition of MAXCPUs used in the stream, and the number of zone
  records in the stream.

- Add structure 'uma_type_header', which defines the name, alignment,
  size, resource allocation limits, current pages allocated, preferred
  bucket size, and central zone + keg statistics.

- Add structure 'uma_percpu_stat', which, for each per-CPU cache,
  includes the number of allocations and frees, as well as the number
  of free items in the cache.

- When the sysctl is queried, return a stream header, followed by a
  series of type descriptions, each consisting of a type header
  followed by a series of MAXCPUs uma_percpu_stat structures holding
  per-CPU allocation information.  Typical values of MAXCPU will be
  1 (UP compiled kernel) and 16 (SMP compiled kernel).

This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.

While here, also export the number of UMA zones as a sysctl
vm.uma_count, in order to assist in sizing user swpace buffers to
receive the stream.

A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days.  This change directly
supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats
rather than separately maintained mbuf allocator statistics.

MFC after:	1 week
2005-07-14 16:35:13 +00:00
Robert Watson
773df9ab16 In addition to tracking allocs in the zone, also track frees. Add
a zone free counter, as well as a cache free counter.

MFC after:	1 week
2005-07-14 16:17:21 +00:00
Alan Cox
eafc7b549a Increase UMA_BOOT_PAGES to prevent a crash during initialization. See
http://docs.FreeBSD.org/cgi/mid.cgi?42AD8270.8060906 for a detailed
description of the crash.

Reported by: Eric Anderson
Approved by: re (scottl)
MFC after: 3 days
2005-06-16 17:06:34 +00:00
Robert Watson
5d1ae027f0 Modify UMA to use critical sections to protect per-CPU caches, rather than
mutexes, which offers lower overhead on both UP and SMP.  When allocating
from or freeing to the per-cpu cache, without INVARIANTS enabled, we now
no longer perform any mutex operations, which offers a 1%-3% performance
improvement in a variety of micro-benchmarks.  We rely on critical
sections to prevent (a) preemption resulting in reentrant access to UMA on
a single CPU, and (b) migration of the thread during access.  In the event
we need to go back to the zone for a new bucket, we release the critical
section to acquire the global zone mutex, and must re-acquire the critical
section and re-evaluate which cache we are accessing in case migration has
occured, or circumstances have changed in the current cache.

Per-CPU cache statistics are now gathered lock-free by the sysctl, which
can result in small races in statistics reporting for caches.

Reviewed by:	bmilekic, jeff (somewhat)
Tested by:	rwatson, kris, gnn, scottl, mike at sentex dot net, others
2005-04-29 18:56:36 +00:00
Bosko Milekic
8076cb5289 Well, it seems that I pre-maturely removed the "All rights reserved"
statement from some files, so re-add it for the moment, until the
related legalese is sorted out.  This change affects:

sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h
2005-02-16 21:45:59 +00:00
Warner Losh
60727d8b86 /* -> /*- for license, minor formatting changes 2005-01-07 02:29:27 +00:00
Bosko Milekic
7b8712053c Add my copyright and update Jeff's copyright on UMA source files,
as per his request.

Discussed with: Jeffrey Roberson
2004-12-26 00:35:12 +00:00
Olivier Houchard
6fc96493ac Remove useless casts. 2004-11-26 15:04:26 +00:00
Bosko Milekic
244f45548a Rework the way slab header storage space is calculated in UMA.
- zone_large_init() stays pretty much the same.
- zone_small_init() will try to stash the slab header in the slab page
  being allocated if the amount of calculated wasted space is less
  than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular
  case).  If the amount of wasted space is >= UMA_MAX_WASTE, then
  UMA_ZONE_OFFPAGE will be set and the slab header will be allocated
  separately for better use of space.
- uma_startup() calculates the maximum ipers required in offpage slabs
  (so that the offpage slab header zone(s) can be sized accordingly).
  The algorithm used to calculate this replaces the old calculation
  (which only happened to work coincidentally).  We now iterate over
  possible object sizes, starting from the smallest one, until we
  determine that wastedspace calculated in zone_small_init() might
  end up being greater than UMA_MAX_WASTE, at which point we use the
  found object size to compute the maximum possible ipers.  The
  reason this works is because:
      - wastedspace versus objectsize is a see-saw function with
        local minima all equal to zero and local maxima growing
        directly proportioned to objectsize.  This implies that
        for objects up to or equal a certain objectsize, the see-saw
        remains entirely below UMA_MAX_WASTE, so for those objectsizes
        it is impossible to ever go OFFPAGE for slab headers.
      - ipers (items-per-slab) versus objectsize is an inversely
        proportional function which falls off very quickly (very large
        for small objectsizes).
      - To determine the maximum ipers we'll ever need from OFFPAGE
        slab headers we first find the largest objectsize for which
        we are guaranteed to not go offpage for and use it to compute
        ipers (as though we were offpage).  Since the only objectsizes
        allowed to go offpage are bigger than the found objectsize,
        and since ipers vs objectsize is inversely proportional (and
        monotonically decreasing), then we are guaranteed that the
        ipers computed is always >= what we will ever need in offpage
        slab headers.
- Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly
  padded) size of each freelist index so that offset calculations are
  fixed.

This might fix weird data corruption problems and certainly allows
ARM to now boot to at least single-user (via simulator).

Tested on i386 UP by me.
Tested on sparc64 SMP by fenner.
Tested on ARM simulator to single-user by cognet.
2004-07-29 15:25:40 +00:00
Bosko Milekic
099a0e588c Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00
Alan Cox
c19aa3402b Increase UMA_BOOT_PAGES because of changes to pv entry initialization in
revision 1.457 of i386/i386/pmap.c.
2004-01-18 05:51:06 +00:00
Alan Cox
925692caa5 - Significantly reduce the number of preallocated pv entries in
pmap_init().  Such a large preallocation is unnecessary and wastes
   nearly eight megabytes of kernel virtual address space per gigabyte
   of managed physical memory.
 - Increase UMA_BOOT_PAGES by two.  This enables the removal of
   pmap_pv_allocf().  (Note: this function was only used during
   initialization, specifically, after pmap_init() but before
   pmap_init2().  During pmap_init2(), a new allocator is installed.)
2003-12-22 01:01:32 +00:00
Jeff Roberson
9643769a3a - Remove the working-set algorithm. Instead, use the per cpu buckets as the
working set cache.  This has several advantages.  Firstly, we never touch
   the per cpu queues now in the timeout handler.  This removes one more
   reason for having per cpu locks.  Secondly, it reduces the size of the zone
   by 8 bytes, bringing it under 200 bytes for a single proc x86 box.  This
   tidies up other logic as well.
 - The 'destroy' flag no longer needs to be passed to zone_drain() since it
   always frees everything in the zone's slabs.
 - cache_drain() is now only called from zone_dtor() and so it destroys by
   default.  It also does not need the destroy parameter now.
2003-09-19 23:27:46 +00:00
Jeff Roberson
3e0cab95c0 - Remove the cache colorization code. We can't use it due to all of the
broken consumers of the malloc interface who assume that the allocated
   address will be an even multiple of the size.
 - Remove disabled time delay code on uma_reclaim().  The comment there said
   it all.  It was not an effective strategy and it should not be left in
   #if 0'd for all eternity.
2003-09-19 23:04:44 +00:00
Jeff Roberson
b60f5b794e - Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags
by accepting the user supplied flags directly.  Previously this was not
   done so that flags for the same field would not be defined in two
   different files.  Add comments in each header instructing future
   developers on how now to shoot their feet.
 - Fix a test for !OFFPAGE which should have been a test for HASH.  This would
   have caused a panic if we had ever destructed a malloc zone.  This also
   opens up the possibility that other zones could use the vsetobj() method
   rather than a hash.
2003-09-19 08:37:44 +00:00
Jeff Roberson
cae33c1429 - Initialize a pool of bucket zones so that we waste less space on zones that
don't cache as many items.
 - Introduce the bucket_alloc(), bucket_free() functions to wrap bucket
   allocation.  These functions select the appropriate bucket zone to
   allocate from or free to.
 - Rename ub_ptr to ub_cnt to reflect a change in its use.  ub_cnt now reflects
   the count of free items in the bucket.  This gets rid of many unnatural
   subtractions by 1 throughout the code.
 - Add ub_entries which reflects the number of entries possibly held in a
   bucket.
2003-09-19 06:26:45 +00:00
Bosko Milekic
20e8e865bd - When deciding whether to init the zone with small_init or large_init,
compare the zone element size (+1 for the byte of linkage) against
  UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE.
  Add a KASSERT in zone_small_init to make sure that the computed
  ipers (items per slab) for the zone is not zero, despite the addition
  of the check, just to be sure (this part submitted by: silby)

- UMA_ZONE_VM used to imply BUCKETCACHE.  Now it implies
  CACHEONLY instead.  CACHEONLY is like BUCKETCACHE in the
  case of bucket allocations, but in addition to that also ensures that
  we don't setup the zone with OFFPAGE slab headers allocated from the
  slabzone.  This means that we're not allowed to have a UMA_ZONE_VM
  zone initialized for large items (zone_large_init) because it would
  require the slab headers to be allocated from slabzone, and hence
  kmem_map.  Some of the zones init'd with UMA_ZONE_VM are so init'd
  before kmem_map is suballoc'd from kernel_map, which is why this
  change is necessary.
2003-08-11 19:39:45 +00:00
Jeff Roberson
f828e5bedb - Get rid of the ill-conceived uz_cachefree member of uma_zone.
- In sysctl_vm_zone use the per cpu locks to read the current cache
   statistics this makes them more accurate while under heavy load.

Submitted by:	tegge
2003-07-30 05:59:17 +00:00
Bosko Milekic
d88797c2ba Move the pcpu lock out of the uma_cache and instead have a single set
of pcpu locks.  This makes uma_zone somewhat smaller (by (LOCKNAME_LEN *
sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact).

No Objections from jeff.
2003-06-25 20:49:48 +00:00
Poul-Henning Kamp
c5d771b807 Prepend _ to internal union members to avoid ambiguity.
Found by:       FlexeLint
2003-05-31 19:52:15 +00:00
Jeff Roberson
48eea37508 - Add support for machine dependant page allocation routines. MD code
may define UMA_MD_SMALL_ALLOC to make use of this feature.

Reviewed by:	peter, jake
2002-11-01 01:01:27 +00:00
Jeff Roberson
f461cf2297 - Use my freebsd email alias in the copyright.
- Remove redundant instances of my email alias in the file summary.
2002-09-19 06:05:32 +00:00
Jeff Roberson
99571dc345 - Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH.
- Remove all instances of the mallochash.
 - Stash the slab pointer in the vm page's object pointer when allocating from
   the kmem_obj.
 - Use the overloaded object pointer to find slabs for malloced memory.
2002-09-18 08:26:30 +00:00
Julian Elischer
e602ba25fd Part 1 of KSE-III
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by:	Almost everyone who counts
	(at various times, peter, jhb, matt, alfred, mini, bernd,
	and a cast of thousands)

	NOTE: this is still Beta code, and contains lots of debugging stuff.
	expect slight instability in signals..
2002-06-29 17:26:22 +00:00
Jeff Roberson
18aa2de5a7 - Introduce the new M_NOVM option which tells uma to only check the currently
allocated slabs and bucket caches for free items.  It will not go ask the vm
  for pages.  This differs from M_NOWAIT in that it not only doesn't block, it
  doesn't even ask.

- Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag.  This
  tells uma that it should only allocate buckets out of the bucket cache, and
  not from the VM.  It does this by using the M_NOVM option to zalloc when
  getting a new bucket.  This is so that the VM doesn't recursively enter
  itself while trying to allocate buckets for vm_map_entry zones.  If there
  are already allocated buckets when we get here we'll still use them but
  otherwise we'll skip it.

- Use the ZONE_VM flag on vm map entries and pv entries on x86.
2002-06-17 22:02:41 +00:00
Jeff Roberson
28bc44195c Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own
mutex class.  Currently this is only used for kmapentzone because kmapents
are are potentially allocated when freeing memory.  This is not dangerous
though because no other allocations will be done while holding the
kmapentzone lock.
2002-04-29 23:45:41 +00:00
Jeff Roberson
af7f9b97b6 Fix the calculation that determines uz_maxpages. It was off for large zones.
Fortunately we have no large zones with maximums specified yet, so it wasn't
breaking anything.

Implement blocking when a zone exceeds the maximum and M_WAITOK is specified.
Previously this just failed like the old zone allocator did.  The old zone
allocator didn't support WAITOK/NOWAIT though so we should do what we
advertise.

While I was in there I cleaned up some more zalloc logic to further simplify
that code path and reduce redundant code.  This was needed to make the blocking
work properly anyway.
2002-04-14 01:56:25 +00:00
Jeff Roberson
1d4cb54ba8 Quiet witness warnings about acquiring several zone locks. In the case that
this happens it is OK.
2002-04-08 21:08:17 +00:00
Jeff Roberson
a553d4b8eb Rework most of the bucket allocation and free code so that per cpu locks are
never held across blocking operations.  Also, fix two other lock order
reversals that were exposed by jhb's witness change.

The free path previously had a bug that would cause it to skip the free bucket
list in some cases and go straight to allocating a new bucket.  This has been
fixed as well.

These changes made the bucket handling code much cleaner and removed quite a
few lock operations.  This should be marginally faster now.

It is now possible to call malloc w/o Giant and avoid any witness warnings.
This still isn't entirely safe though because malloc_type statistics are not
protected by any lock.
2002-04-08 02:42:55 +00:00
Jeff Roberson
c235bfa551 Spelling correction; s/seperate/separate/g
Submitted by:	eric
2002-04-07 22:56:48 +00:00
John Baldwin
6008862bc2 Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on:	i386, alpha, sparc64
2002-04-04 21:03:38 +00:00
Jeff Roberson
f22a4b62f5 Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks
with this flag.  Remove the dup_list and dup_ok code from subr_witness.  Now
we just check for the flag instead of doing string compares.

Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface.  The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.

Approved by:	jhb
2002-03-27 09:23:41 +00:00
Jeff Roberson
8355f576a9 This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by:	arch@
2002-03-19 09:11:49 +00:00