mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.
Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.
mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.
From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.
Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.
This change removes more code than it adds.
A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.
Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)
<netinet/tcp_var.h>'s prerequisites. Prerequistes should not grow for
userland headers, and <netinet/tcp_var.h> is unfortunately still needed
in userland.
where MB/s and tps statistics would always be zero, presumably because
they were being averaged out over the time between now and when the
system booted instead of a few seconds.
PR: 58683
Kernel:
Change statistics to use the *uptime() timescale (ie: relative to
boottime) rather than the UTC aligned timescale. This makes the
device statistics code oblivious to clock steps.
Change timestamps to bintime format, they are cheaper.
Remove the "busy_count", and replace it with two counter fields:
"start_count" and "end_count", which are updated in the down and
up paths respectively. This removes the locking constraint on
devstat.
Add a timestamp argument to devstat_start_transaction(), this will
normally be a timestamp set by the *_bio() function in bp->bio_t0.
Use this field to calculate duration of I/O operations.
Add two timestamp arguments to devstat_end_transaction(), one is
the current time, a NULL pointer means "take timestamp yourself",
the other is the timestamp of when this transaction started (see
above).
Change calculation of busy_time to operate on "the salami principle":
Only when we are idle, which we can determine by the start+end
counts being identical, do we update the "busy_from" field in the
down path. In the up path we accumulate the timeslice in busy_time
and update busy_from.
Change the byte_* and num_* fields into two arrays: bytes[] and
operations[].
Userland:
Change the misleading "busy_time" name to be called "snap_time" and
make the time long double since that is what most users need anyway,
fill it using clock_gettime(CLOCK_MONOTONIC) to put it on the same
timescale as the kernel fields.
Change devstat_compute_etime() to operate on struct bintime.
Remove the version 2 legacy interface: the change to bintime makes
compatibility far too expensive.
Fix a bug in systat's "vm" page where boot relative busy times would
be bogus.
Bump __FreeBSD_version to 500107
Review & Collaboration by: ken
ifstat Display the network traffic going through active interfaces
on the system. Idle interfaces will not be displayed until
they receive some traffic.
For each interface being displayed, the current, peak and
total statistics are displayed for incoming and outgoing
traffic. By default, the ifstat display will automatically
scale the units being used so that they are in a human-read-
able format. The scaling units used for the current and peak
traffic columns can be altered by the scale command.
Submitted by: Trent Nelson <trent@arpa.com>
non-default but reasonable values of hz this member overflowed,
breaking NFS over UDP.
Also, as long as I'm plowing up struct sockbuf ... Change certain
members from u_long/long to u_int/int in order to reduce wasted
space on 64-bit machines. This change was requested by Andrew
Gallatin.
Netstat and systat need to be rebuilt. I am incrementing
__FreeBSD_version in case any ports need to change.
when I changed the allocator bits. This implements per-CPU mbtypes
stats by keeping net number of decrements/increments of a given mbtype
per-CPU and then summing all of the per-CPU mbtypes to produce the total
net number of allocated mbufs of the given mbtype.
Counters are carefully balanced to avoid/prevent underflows/overflows.
mbtypes stats are re-enabled with the idea that we may occasionally
(although very rarely) observe slight inconsistencies in the stat
reporting. Most of the time, we should be fine, though.
Also make appropriate modifications to netstat(1) and systat(1) to do
the necessary reporting.
Submitted by: Jiangyi Liu <jyliu@163.net>
were indices in a dense array. The cpuids are a sparse set and treat
them as such, setting up containers only for CPUs activated during
mb_init().
- Fix netstat(1) and systat(1) to treat the per-CPU stats area as a sparse
map, in accordance with the above.
This allows us to properly boot with certain CPUs disactivated. However, if
we later decide to re-activate said CPUs, we will barf until we decide to
implement CPU spinon/spinoff callback hooks to allow for said CPUs' per-CPU
containers to get configured on their activation.
Reported by: mjacob
Partially (sys/ diffs) Submitted by: mjacob
introduce a modified allocation mechanism for mbufs and mbuf clusters; one
which can scale under SMP and which offers the possibility of resource
reclamation to be implemented in the future. Notable advantages:
o Reduce contention for SMP by offering per-CPU pools and locks.
o Better use of data cache due to per-CPU pools.
o Much less code cache pollution due to excessively large allocation macros.
o Framework for `grouping' objects from same page together so as to be able
to possibly free wired-down pages back to the system if they are no longer
needed by the network stacks.
Additional things changed with this addition:
- Moved some mbuf specific declarations and initializations from
sys/conf/param.c into mbuf-specific code where they belong.
- m_getclr() has been renamed to m_get_clrd() because the old name is really
confusing. m_getclr() HAS been preserved though and is defined to the new
name. No tree sweep has been done "to change the interface," as the old
name will continue to be supported and is not depracated. The change was
merely done because m_getclr() sounds too much like "m_get a cluster."
- TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and
systat(1) (see TODO below).
- Fixed systat(1) to display number of "free mbufs" based on new per-CPU
stat structures.
- Fixed netstat(1) to display new per-CPU stats based on sysctl-exported
per-CPU stat structures. All infos are fetched via sysctl.
TODO (in order of priority):
- Re-enable mbtypes statistics in both netstat(1) and systat(1) after
introducing an SMP friendly way to collect the mbtypes stats under the
already introduced per-CPU locks (i.e. hopefully don't use atomic() - it
seems too costly for a mere stat update, especially when other locks are
already present).
- Optionally have systat(1) display not only "total free mbufs" but also
"total free mbufs per CPU pool."
- Fix minor length-fetching issues in netstat(1) related to recently
re-enabled option to read mbuf stats from a core file.
- Move reference counters at least for mbuf clusters into an unused portion
of the cluster itself, to save space and need to allocate a counter.
- Look into introducing resource freeing possibly from a kproc.
Reviewed by (in parts): jlemon, jake, silby, terry
Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha)
Preliminary performance measurements: jlemon (and me, obviously)
URL: http://people.freebsd.org/~bmilekic/mb_alloc/
and compiler warnings.
The data for network statistics are still obtained via the kvm interface
if systat was started with the needed privileges, otherwise sysctls are
used. The reason for this is that with really many open sockets, the
sysctl method is probably slower, but it systat -netstat is probably not
really usable in either mode under these conditions.
Approved by: rwatson
base system, but not in BruceBSD.
o Fix up style violations of various sorts.
o Remove redundant normalization of hertz variable, as the sysctl handler
does this work (unlike when kread was used).
Submitted by: bde
no longer contains kernel specific data structures, but rather
only scalar values and structures that are already part of the
kernel/user interface, specifically rusage and rtprio. It no
longer contains proc, session, pcred, ucred, procsig, vmspace,
pstats, mtx, sigiolst, klist, callout, pasleep, or mdproc. If
any of these changed in size, ps, w, fstat, gcore, systat, and
top would all stop working. The new structure has over 200 bytes
of unassigned space for future values to be added, yet is nearly
100 bytes smaller per entry than the structure that it replaced.
and numvnodes are longs in the kernel. They should remain longs in systat,
what really needs to change is that they should be using SYSCTL_LONG rather
than SYSCTL_INT. I also changed wantfreevnodes to SYSCTL_LONG because I
happened to notice it.
I wish there was a way to find all of these automatically..
Pointed out by: bde
maxvnodes, numvnodes, freevnodes, nchstats, and numdirtybuffers.
o Make the hw.ncpu error checking code a little more rigorous by
sanity checking the returned data size.
o Didn't fix machine-dependent non-sysctl-exported variables:
intrnames, eintrnames, intrcnt, eintrcnt, as these variables are
defined and exported from machine-dependent kernel code in
assembly. This should probably be fixed somehow.
structure member that doesn't exist anymore.
Use getsysctlbyname for kern.ipc.mbstat instead of sysctl.
Use netstat's method of displaying values from mtnames.
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
Missed by PR: 19809
track.
The $Id$ line is normally at the bottom of the main comment block in the
man page, separated from the rest of the manpage by an empty comment,
like so;
.\" $Id$
.\"
If the immediately preceding comment is a @(#) format ID marker than the
the $Id$ will line up underneath it with no intervening blank lines.
Otherwise, an additional blank line is inserted.
Approved by: bde
generation was causing unaligned access faults on the Alpha.
I have incremented the devstat version number, since this is an interface
change. You'll need to recompile libdevstat, systat, iostat, vmstat and
rpc.rstatd along with your kernel.
Partially Submitted by: Andrew Gallatin <gallatin@cs.duke.edu>
o Add more checks for buffer overflows
o Use snprintf rather than strcat/cpy and have better checks for max
length exceeded.
Most of these changes are not exploitable buffer overruns, but it never
hurts to be safe.
Inspired by and obtained from: OpenBSD
in netstat-mode to avoid a conflict with tcp-mode. Also
while documenting this new feature in the manpage, fix a
minor display nit.
PR: 5159
Submitted by: Sergei Chechetkin <csl@whale.sunbay.crimea.ua>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
date: 1994/10/09 07:37:18; author: davidg; state: Exp; lines: +7 -1
#if 0'd out the meat of the swap code until I get a chance to rewrite it.
...mainly by stealing the code from pstat(8).
of mbufs in use. If the number reached, e.g., 4 digits, then later
decreased to 3 digits, the last digit of the 4-digit number was
not erased. This caused the display to show a wildly high number of
mbufs in use.
isn't used in systat or in the kernel (it was replaced by a sysctl()
call involving VM_METER) and will go away when I clean up bogus
common variables in the kernel.
I got irritated with not seeing the interrupt numbers in a (crowded)
"systat -vmstat" display, so I fixed it. Here is a patch to please be
applied in src/usr.bin/systat
Use the correct value of hz (stathz if it is nonzero) for
interpretion of dk_time[] and cp_time[] in iostat.c. Avoid
multiple conversions of this value in iostat.c and vmstat.c
iostat.c:
Implement the display of cp_time[CP_INTR]. Fix the display
of cp_time[CP_IDLE] (the display was always null because
cp_time[CP_INTR] == 0 was displayed instead).
systat.1:
Document the display of cp_time[CP_INTR].
vmstat.c:
Implement the display of cp_time[CP_INTR].
DANGER WILL ROBINSON!
_PATH_UNIX is currently defined as the literal string "don't use this".
I am of two minds about this myself, but wanted to get something into the
tree as quickly as possible.