Commit Graph

182 Commits

Author SHA1 Message Date
avg
2e73196837 add kmem_map_free sysctl: query largest contiguous free range in kmem_map
Suggested by:	alc
Reviewed by:	alc
MFC after:	1 week
2010-10-09 09:03:17 +00:00
avg
7010764d95 vm.kmem_map_size: a sysctl to query current kmem_map->size
Based on a patch from Sandvine Incorporated via emaste.

Reviewed by:	emaste
MFC after:	1 week
2010-10-07 18:11:33 +00:00
avg
54a87db1fb kmem_size* sysctls: hint that these are also tunables
MFC after:	1 week
2010-09-30 16:45:27 +00:00
mdf
5695ef4698 Re-add r212370 now that the LOR in powerpc64 has been resolved:
Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte.  This behaviour was preserved, though it should not be
necessary.

Reviewed by:    phk (original patch)
2010-09-16 16:13:12 +00:00
mdf
3ed6eac561 Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn
2010-09-13 18:48:23 +00:00
mdf
bc54684253 Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte.  This behaviour was preserved, though it should not be necessary.

Reviewed by:	phk
2010-09-09 18:33:46 +00:00
mdf
42170bf6d6 The realloc case for memguard(9) will copy too many bytes when
reallocating to a smaller-sized allocation.  Fix this issue.

Noticed by:     alc
Reviewed by:    alc
Approved by:    zml (mentor)
MFC after:      3 weeks
2010-08-31 16:57:58 +00:00
mdf
0737955344 Rework memguard(9) to reserve significantly more KVA to detect
use-after-free over a longer time.  Also release the backing pages of
a guarded allocation at free(9) time to reduce the overhead of using
memguard(9).  Allow setting and varying the malloc type at run-time.
Add knobs to allow:

 - randomly guarding memory
 - adding un-backed KVA guard pages to detect underflow and overflow
 - a lower limit on the size of allocations that are guarded

Reviewed by:    alc
Reviewed by:    brueffer, Ulrich Spörlein <uqs spoerlein net> (man page)
Silence from:   -arch
Approved by:    zml (mentor)
MFC after:      1 month
2010-08-11 22:10:37 +00:00
mdf
6857471cf3 Add MALLOC_DEBUG_MAXZONES debug malloc(9) option to use multiple uma
zones for each malloc bucket size.  The purpose is to isolate
different malloc types into hash classes, so that any buffer overruns
or use-after-free will usually only affect memory from malloc types in
that hash class.  This is purely a debugging tool; by varying the hash
function and tracking which hash class was corrupted, the intersection
of the hash classes from each instance will point to a single malloc
type that is being misused.  At this point inspection or memguard(9)
can be used to catch the offending code.

Add MALLOC_DEBUG_MAXZONES=8 to -current GENERIC configuration files.
The suggestion to have this on by default came from Kostik Belousov on
-arch.

This code is based on work by Ron Steinke at Isilon Systems.

Reviewed by:    -arch (mostly silence)
Reviewed by:    zml
Approved by:    zml (mentor)
2010-07-28 15:36:12 +00:00
ed
76489ac1ea Use ISO C99 integer types in sys/kern where possible.
There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.
2010-06-21 09:55:56 +00:00
brian
3f1308d2b5 If we're passed garbage in malloc_init(), panic() rather than expecting
a KASSERT to handle it.  People are likely to turn off INVARIANTS RSN
and loading an old module can cause garbage-in here.

I saw the issue with an older nvidia driver (x11/nvidia-driver) loading
into a new kernel - a crash wasn't seen 'till sysctl_kern_malloc_stats().
I was lucky that mtp->ks_shortdesc was NULL and not something horrible.

While I'm here, KASSERT that malloc_uninit() isn't passed something that's
not in kmemstatistics.

MFC after:	3 weeks
2009-06-05 09:16:52 +00:00
imp
66ca9cb573 Retire kern.vm.kmem.size. It was marked as obsolete prior to 5.2, so
it can go.
2009-05-09 19:00:47 +00:00
rwatson
367054e0a3 struct malloc_type has had a 'magic' field statically initialized to
M_MAGIC by MALLOC_DEFINE() for a long time; add assertions that
malloc_type's passed to malloc(), free(), etc have that magic set.

MFC after:	2 weeks
2009-04-19 12:41:37 +00:00
ed
b3ddcfe1f7 Remove even more unneeded variable assignments.
kern_time.c:
- Unused variable `p'.

kern_thr.c:
- Variable `error' is always caught immediately, so no reason to
  initialize it. There is no way that error != 0 at the end of
  create_thread().

kern_sig.c:
- Unused variable `code'.

kern_synch.c:
- `rval' is always assigned in all different cases.

kern_rwlock.c:
- `v' is always overwritten with RW_UNLOCKED further on.

kern_malloc.c:
- `size' is always initialized with the proper value before being used.

kern_exit.c:
- `error' is always caught and returned immediately. abort2() never
  returns a non-zero value.

kern_exec.c:
- `len' is always assigned inside the if-statement right below it.

tty_info.c:
- `td' is always overwritten by FOREACH_THREAD_IN_PROC().

Found by:	LLVM's scan-build
2009-02-26 15:51:54 +00:00
jeff
69d1bd8670 - Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
   This is useful for cases such as NUMA or different layouts for the same
   memory type.
 - Provide a new api for adding new backend kegs to secondary zones.
 - Provide a new flag for adjusting the layout of zones to stagger
   allocations better across cache lines.

Sponsored by:	Nokia
2009-01-25 09:11:24 +00:00
alc
c016906f4e Enable the creation of a kmem map larger than 4GB.
Submitted by: Tz-Huan Huang

Make several variables related to kmem map auto-sizing static.
Found by: CScout
2008-07-05 19:34:33 +00:00
alc
b7d6153751 Correct an error in the comments for init_param3().
Discussed with: silby
2008-07-04 19:36:58 +00:00
jb
8c4eed9aad Add support for the DTrace malloc provider which can enable probes
on a per-malloc type basis.
2008-05-23 00:43:36 +00:00
alc
c251140c26 Introduce a new parameter "superpage_align" to kmem_suballoc() that is
used to request superpage alignment for the submap.

Request superpage alignment for the kmem_map.

Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find().  (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)

Remove a stale comment from kmem_malloc().
2008-05-10 21:46:20 +00:00
rwatson
877d7c65ba In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
rwatson
e09b4894c7 Use vm_offset_t for kmembase and kmemlimit rather than char *, avoiding
unnecessary casts, and making it possible to compile kern_malloc.c with
strict aliasing.

Submitted by:	rdivacky
Approved by:	re (kensmith)
2007-06-27 13:39:38 +00:00
rwatson
680f293154 Spell statistics more correctly in comments. 2007-06-14 03:02:33 +00:00
attilio
7dd8ed88a9 Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
rwatson
048e4018c8 Remove #if 0'd check for 0-size allocations, which if enabled, called
kdb_enter().
2007-05-27 13:13:46 +00:00
jeff
e1996cb960 - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
sepotvin
a1e73b1eaf Add support for specifying a minimal size for vm.kmem_size in the loader via
vm.kmem_size_min. Useful when using ZFS to make sure that vm.kmem size will
be at least 256mb (for example) without forcing a particular value via vm.kmem_size.

Approved by: njl (mentor)
Reviewed by: alc
2007-04-21 01:14:48 +00:00
rwatson
55067e6cf7 Increase usefulness of "show malloc" by moving from displaying the basic
counters of allocs/frees/use for each malloc type to calculating InUse,
MemUse, and Requests as displayed by the userspace vmstat -m.  This is
more useful when debugging malloc(9)-related memory leaks, where the
count of allocs/frees may not usefully reflect that current memory
allocation (i.e., when highly variable size allocations occur with the
same malloc type, such as with contigmalloc).

MFC after:			3 days
Limitations observed by:	scottl
2006-10-26 10:17:13 +00:00
rwatson
711726cd79 Remove old kern.malloc sysctl, which generated a text representation of
the kernel malloc(9) state for vmstat -m.  libmemstat is now used to
generate a machine-readable version which is converged by vmstat -m
into a human-readable version.

Not for MFC.
2006-07-23 19:55:41 +00:00
rwatson
91cb1c84be Expand comments for malloc(9) to better describe the design and
statistics / memory types model.
2006-07-23 19:51:39 +00:00
ps
00f6401a91 Fix bug in malloc_uninit():
Releasing items from the mt_zone can not be done by a simple
uma_zfree() call since mt_zone is allocated with the UMA_ZONE_MALLOC
flag. Use uma_zfree_arg instead and supply the slab.

This bug caused panics in low memory situations on unloading kernel
modules containing MALLOC_DEFINE(..) statements.

Submitted by:	ups
2006-03-03 22:36:52 +00:00
pjd
645dd7b662 Add buffer corruption protection (RedZone) for kernel's malloc(9).
It detects both: buffer underflows and buffer overflows bugs at runtime
(on free(9) and realloc(9)) and prints backtraces from where memory was
allocated and from where it was freed.

Tested by:	kris
2006-01-31 11:09:21 +00:00
pjd
3cc29e6ebf Improve memguard a bit:
- Provide tunable vm.memguard.desc, so one can specify memory type without
  changing the code and recompiling the kernel.
- Allow to use memguard for kernel modules by providing sysctl
  vm.memguard.desc, which can be changed to short description of memory
  type before module is loaded.
- Move as much memguard code as possible to memguard.c.
- Add sysctl node vm.memguard. and move memguard-specific sysctl there.
- Add malloc_desc2type() function for finding memory type based on its
  short description (ks_shortdesc field).
- Memory type can be changed (via vm.memguard.desc sysctl) only if it
  doesn't exist (will be loaded later) or when no memory is allocated yet.
  If there is allocated memory for the given memory type, return EBUSY.
- Implement two ways of memory types comparsion and make safer/slower the
  default.
2005-12-30 11:45:07 +00:00
pjd
6617f46a01 In realloc(9), determine size of the original block based on
UMA_SLAB_MALLOC flag.
In some circumstances (I observed it when I was doing a lot of reallocs)
UMA_SLAB_MALLOC can be set even if us_keg != NULL.

If this is the case we have wonderful, silent data corruption, because less
data is copied to the newly allocated region than should be.

I'm not sure when this bug was introduced, it could be there undetected
for years now, as we don't have a lot of realloc(9) consumers and it was
hard to reproduce it...
...but what I know for sure, is that I don't want to know who introduce
the bug:) It took me two/three days to track it down (of course most of
the time I was looking for the bug in my own code).
2005-12-28 01:53:13 +00:00
pjd
445c73f269 Detect memory leaks when memory type is being destroyed.
This is very helpful for detecting kernel modules memory leaks on unload.

Discussed and reviewed by:	rwatson
2005-11-03 13:48:59 +00:00
rwatson
f604e28c4e Change format string for u_int64_t to %ju from %llu, in order to use the
correct format string on 64-bit systems.

Pointed out by:	pjd
2005-10-20 21:28:31 +00:00
rwatson
dc9d32cb6e Add a "show malloc" command to DDB, which prints out the current stats for
available kernel malloc types.  Quite useful for post-mortem debugging of
memory leaks without a dump device configured on a panicked box.

MFC after:	2 weeks
2005-10-20 17:41:47 +00:00
ru
c7ff05fb06 Long overdue, keep up with mbuf.h,v 1.148. 2005-08-02 20:03:23 +00:00
pjd
9c0918e899 Fix the way how "InUse" column in 'vmstat -m' output works:
- increase number of allocations count only on successfull malloc(9),
  so it doesn't confuse people;
- because we need to check if 'size > 0', hide 'mtsp->mts_memalloced += size;'
  under the check as well, as for size=0 it is of course a no-op;
- avoid critical_enter()/critical_exit() in case of failure in
  malloc_type_allocated() as there will be nothing to do.

OK'ed by:	rwatson
MFC after:	2 days
2005-07-27 23:17:31 +00:00
rwatson
83059d9b07 Correct build on 64-bit: cast u_int64_t to (unsigned long long) before
printfing as (unsigned long long).  32-bit build on i386 didn't notice
this.  Whoops.

Reported by:	arved
Tested by:	sledge
2005-07-14 15:21:18 +00:00
rwatson
7a35a61825 Introduce a new sysctl, kern.malloc_stats, which exports kernel malloc
statistics via a binary structure stream:

- Add structure 'malloc_type_stream_header', which defines a stream
  version, definition of MAXCPUS used in the stream, and a number of
  malloc_type records in the stream.

- Add structure 'malloc_type_header', which defines the name of the
  malloc type being reported on.

- When the sysctl is queried, return a stream header, followed by a
  series of type descriptions, each consisting of a type header
  followed by a series of MAXCPUS malloc_type_stats structures holding
  per-CPU allocation information.  Typical values of MAXCPUS will be 1
  (UP compiled kernel) and 16 (SMP compiled kernel).

This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.

While here:

- Bump statistics width to uint64_t, and hard code using fixed-width
  type in order to be more sure about structure layout in the stream.
  We allocate and free a lot of memory.

- Add kmemcount, a counter of the number of registered malloc types,
  in order to avoid excessive manual counting of types.  Export via a
  new sysctl to allow user-space code to better size buffers.

- De-XXX comment on no longer maintaining the high watermark in old
  sysctl monitoring code.

A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days.  Likewise, similar changes
to UMA.
2005-07-14 11:52:06 +00:00
kensmith
82cf72b8bb Remove a variable that became unused as a result of changes made
in v1.139.  This was only exposed if MALLOC_PROFILE was defined.

Submitted by:	Gary Jennejohn
Pointy hat:	rwatson
Approved by:	re (scottl)
2005-06-16 16:01:46 +00:00
jkoshy
b195d18520 Fix typo.
Reviewed by:	rwatson, sam
2005-06-10 18:06:59 +00:00
rwatson
7035f9f56a Kernel malloc layers malloc_type allocation over one of two underlying
allocators: a set of power-of-two UMA zones for small allocations, and the
VM page allocator for large allocations.  In order to maintain unified
statistics for specific malloc types, kernel malloc maintains a separate
per-type statistics pool, which can be monitored using vmstat -m.  Prior
to this commit, each pool of per-type statistics was protected using a
per-type mutex associated with the malloc type.

This change modifies kernel malloc to maintain per-CPU statistics pools
for each malloc type, and protects writing those statistics using critical
sections.  It also moves to unsynchronized reads of per-CPU statistics
when generating coalesced statistics.  To do this, several changes are
implemented:

- In the previous world order, the statistics memory was allocated by
  the owner of the malloc type structure, allocated statically using
  MALLOC_DEFINE().  This embedded the definition of the malloc_type
  structure into all kernel modules.  Move to a model in which a pointer
  within struct malloc_type points at a UMA-allocated
  malloc_type_internal data structure owned and maintained by
  kern_malloc.c, and not part of the exported ABI/API to the rest of
  the kernel.  For the purposes of easing a possible MFC, re-use an
  existing pointer in 'struct malloc_type', and maintain the current
  malloc_type structure size, as well as layout with respect to the
  fields reused outside of the malloc subsystem (such as ks_shortdesc).
  There are several unused fields as a result of no longer requiring
  the mutex in malloc_type.

- Struct malloc_type_internal contains an array of malloc_type_stats,
  of size MAXCPU.  The structure defined above avoids hard-coding a
  kernel compile-time value of MAXCPU into kernel modules that interact
  with malloc.

- When accessing per-cpu statistics for a malloc type, surround read -
  modify - update requests with critical_enter()/critical_exit() in
  order to avoid races during write.  The per-CPU fields are written
  only from the CPU that owns them.

- Per-CPU stats now maintained "allocated" and "freed" counters for
  number of allocations/frees and bytes allocated/freed, since there is
  no longer a coherent global notion of the totals.  When coalescing
  malloc stats, accept a slight race between reading stats across CPUs,
  and avoid showing the user a negative allocation count for the type
  in the event of a race.  The global high watermark is no longer
  maintained for a malloc type, as there is no global notion of the
  number of allocations.

- While tearing up the sysctl() path, also switch to using sbufs.  The
  current "export as text" sysctl format is retained with the same
  syntax.  We may want to change this in the future to export more
  per-CPU information, such as how allocations and frees are balanced
  across CPUs.

This change results in a substantial speedup of kernel malloc and free
paths on SMP, as critical sections (where usable) out-perform mutexes
due to avoiding atomic/bus-locked operations.  There is also a minor
improvement on UP due to the slightly lower cost of critical sections
there.  The cost of the change to this approach is the loss of a
continuous notion of total allocations that can be exploited to track
per-type high watermarks, as well as increased complexity when
monitoring statistics.

Due to carefully avoiding changing the ABI, as well as hardening the ABI
against future changes, it is not necessary to recompile kernel modules
for this change.  However, MFC'ing this change to RELENG_5 will require
also MFC'ing optimizations for soft critical sections, which may modify
exposed kernel ABIs.  The internal malloc API is changed, and
modifications to vmstat in order to restore "vmstat -m" on core dumps will
follow shortly.

Several improvements from:		bde
Statistics approach discussed with:	ups
Tested by:				scottl, others
2005-05-29 13:38:07 +00:00
rwatson
005d2e3810 Consistently style function declarations in kern_malloc.c.
MFC after:	3 days
2005-04-12 23:54:34 +00:00
bmilekic
da7116f3ac Bring in MemGuard, a very simple and small replacement allocator
designed to help detect tamper-after-free scenarios, a problem more
and more common and likely with multithreaded kernels where race
conditions are more prevalent.

Currently MemGuard can only take over malloc()/realloc()/free() for
particular (a) malloc type(s) and the code brought in with this
change manually instruments it to take over M_SUBPROC allocations
as an example.  If you are planning to use it, for now you must:

	1) Put "options DEBUG_MEMGUARD" in your kernel config.
	2) Edit src/sys/kern/kern_malloc.c manually, look for
	   "XXX CHANGEME" and replace the M_SUBPROC comparison with
	   the appropriate malloc type (this might require additional
	   but small/simple code modification if, say, the malloc type
	   is declared out of scope).
	3) Build and install your kernel.  Tune vm.memguard_divisor
	   boot-time tunable which is used to scale how much of kmem_map
	   you want to allott for MemGuard's use.  The default is 10,
	   so kmem_size/10.

ToDo:
	1) Bring in a memguard(9) man page.
	2) Better instrumentation (e.g., boot-time) of MemGuard taking
	   over malloc types.
	3) Teach UMA about MemGuard to allow MemGuard to override zone
	   allocations too.
	4) Improve MemGuard if necessary.

This work is partly based on some old patches from Ian Dowse.
2005-01-21 18:09:17 +00:00
imp
20280f1431 /* -> /*- for copyright notices, minor format tweaks as necessary 2005-01-06 23:35:40 +00:00
des
f665f60342 Turn VM_KMEM_SIZE_MAX and VM_KMEM_SIZE_SCALE into tunables.
MFC after:	3 days
2004-09-29 14:21:40 +00:00
green
c4b4a8f048 Reimplement contigmalloc(9) with an algorithm which stands a greatly-
improved chance of working despite pressure from running programs.
Instead of trying to throw a bunch of pages out to swap and hope for
the best, only a range that can potentially fulfill contigmalloc(9)'s
request will have its contents paged out (potentially, not forcibly)
at a time.

The new contigmalloc operation still operates in three passes, but it
could potentially be tuned to more or less.  The first pass only looks
at pages in the cache and free pages, so they would be thrown out
without having to block.  If this is not enough, the subsequent passes
page out any unwired memory.  To combat memory pressure refragmenting
the section of memory being laundered, each page is removed from the
systems' free memory queue once it has been freed so that blocking
later doesn't cause the memory laundered so far to get reallocated.

The page-out operations are now blocking, as it would make little sense
to try to push out a page, then get its status immediately afterward
to remove it from the available free pages queue, if it's unlikely to
have been freed.  Another change is that if KVA allocation fails, the
allocated memory segment will be freed and not leaked.

There is a sysctl/tunable, defaulting to on, which causes the old
contigmalloc() algorithm to be used.  Nonetheless, I have been using
vm.old_contigmalloc=0 for over a month.  It is safe to switch at
run-time to see the difference it makes.

A new interface has been used which does not require mapping the
allocated pages into KVA: vm_page.h functions vm_page_alloc_contig()
and vm_page_release_contig().  These are what vm.old_contigmalloc=0
uses internally, so the sysctl/tunable does not affect their operation.

When using the contigmalloc(9) and contigfree(9) interfaces, memory
is now tracked with malloc(9) stats.  Several functions have been
exported from kern_malloc.c to allow other subsystems to use these
statistics, as well.  This invalidates the BUGS section of the
contigmalloc(9) manpage.
2004-07-19 06:21:27 +00:00
marcel
a9ad69d5af Update for the KDB framework:
o  Make debugging code conditional upon KDB instead of DDB.
o  Call kdb_enter() instead of Debugger().
o  Call kdb_backtrace() instead of db_print_backtrace() or backtrace().

kern_mutex.c:
o  Replace checks for db_active with checks for kdb_active and make
   them unconditional.

kern_shutdown.c:
o  s/DDB_UNATTENDED/KDB_UNATTENDED/g
o  s/DDB_TRACE/KDB_TRACE/g
o  Save the TID of the thread doing the kernel dump so the debugger
   knows which thread to select as the current when debugging the
   kernel core file.
o  Clear kdb_active instead of db_active and do so unconditionally.
o  Remove backtrace() implementation.

kern_synch.c:
o  Call kdb_reenter() instead of db_error().
2004-07-10 21:36:01 +00:00
bmilekic
f7574a2276 Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00