I had commented the #ifdef INVARIANTS checks out to make sure I ran this
code in all kernels and forgot to comment the #ifdefs back in before I
committed.
Spotted by: bmilekic
[1] PHCC = Pointy Hat Correction Commit
compile-time constants). That is, a "bucket" now is not necessarily
a page-worth of mbufs or clusters, but it is MBUF_BUCK_SZ, CLUS_BUCK_SZ
worth of mbufs, clusters.
o Rename {mbuf,clust}_limit to {mbuf,clust}_hiwm and introduce
{mbuf,clust}_lowm, which currently has no effect but will be used
to set the low watermarks.
o Fix netstat so that it can deal with the differently-sized buckets
and teach it about the low watermarks too.
o Make sure the per-cpu stats for an absent CPU has mb_active set to 0,
explicitly.
o Get rid of the allocate refcounts from mbuf map mess. Instead,
just malloc() the refcounts in one shot from mbuf_init()
o Clean up / update comments in subr_mbuf.c
reference counter array for mbuf clusters. I don't know
how this got past early testing nor how it survived so long
without getting caught. If anyone was seeing really really
bizarre memory corruption in a few mbufs this would be why.
opposed to returning the top of the old chain when there was one and
the top of the newly allocated chain if there was no old chain.
Actually, it should be noted that prior to this fix, although the
comment above m_getm() advertised that m_getm() would return the
top of the old chain (if an old chain was being passed in) it
actually [wrongly] was returning the tail mbuf in the old chain
instead. This is a bug but since the one use of m_getm() in
the tree luckily did not depend on the behavior, it happened
to work out without notice.
Harti Brandt pointed out that the advertised behavior was actually
not the real behavior and so this change makes m_getm() ALWAYS
return the newly allocated chain (and fixes the comment). This
is less confusing and is the best course of action as then the
caller is always able to have both a reference to the top of
the original chain (because it's passing it in in the call) and
a reference to the newly attached chain. Although the API is
slightly modified, I don't think that any third-party code uses
m_getm() and if it does, it surely can't be working properly
because the old behavior was bogus.
API bug pointed out by: Harti Brandt <brandt@fokus.fraunhofer.de>
o Allow callers of m_extadd() to allocate their own reference
m_ext.ref_cnt pointer, rather than having the mbuf system allocate it
with a malloc() in the critical path. This speeds m_extadd() up, and
also simplifies locking (malloc() may need Giant).
A driver or subsystem wishing to take use its own ref counter must
initialize m_ext.ref_cnt to point to its ref counter prior to
calling m_extadd(), and it must use EXT_EXTREF as its external type.
Eg:
m->m_ext.ref_cnt = my_ref_cnt_ptr;
m_extadd(.....,EXT_EXTREF);
Reviewed by: bosko
on this.
o Update the `cur' pointer in the cluster loop in m_getm() to avoid
incorrect truncation and leaked mbufs.
Reviewed by: bmilekic
Approved by: re
contiguous space was being allocated from the clust_map
instead of the mbuf_map as the comments indicated. This resulted in
some address space wastage in mbuf_map.
Submitted by: Rohit Jalan <rohjal@yahoo.co.in>
o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version
Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month
type of the 'flags' argument m_getcl() was using anyway; m_extadd()
needed to be changed to accept an int instead of a short for 'flags.'
This makes things more consistent and also gives us more bits to
use for m_flags in the future (we have almost run out).
Requested by: sam (Sam Leffler)
appologize to those of you who may have been seeing crashes in
code that uses sendfile(2) or other types of external buffers
with mbufs.
Pointed out by, and provided trace:
Niels Chr. Bank-Pedersen <ncbp at bank-pedersen.dk>
argument, not the 'type' argument. As a result of the buf, the
MAC label on some packet header mbufs might not be set in mbufs
allocated using m_getcl(), resulting in a page fault.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
the inits/destroys are done without the cache locks held even in the
persistent-lock calls. I may be cheating a little by using the MAC
"already initialized" flag for now.
kernel access control.
Invoke the necessary MAC entry points to maintain labels on header
mbufs. In particular, invoke entry points during the two mbuf
header allocation cases, and the mbuf freeing case. Pass the "how"
argument at allocation time to the MAC framework so that it can
determine if it is permitted to block (as with policy modules),
and permit the initialization entry point to fail if it needs to
allocate memory but is not permitted to, failing the mbuf
allocation.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
While I don't think this is the best solution, it certainly is the
fastest and in trying to find bottlenecks in network related code
I want this out of the way, so that I don't have to think about it.
What this means, for mbuf clusters anyway is:
- one less malloc() to do for every cluster allocation (replaced with
a relatively quick calculation + assignment)
- no more free() in the cluster free case (replaced with empty space) :-)
This can offer a substantial throughput improvement, but it may not for
all cases. Particularly noticable for larger buffer sends/recvs.
See http://people.freebsd.org/~bmilekic/code/measure2.txt for a rough
idea.
of the inlines, like its cousin, m_free(). Also, make a small (first
step?) optimisation of m_free() to use the MBP_PERSIST{,ENT} interface
to hold the lock across frees when possible. The thing is that right
now, we can only do this easily for at most across one mbuf + one
cluster free, as the comment mentions (it also explains why). Anyway,
some basic tests revealed a 5-10% overall improvement. Some of the
results can be found here:
http://people.freebsd.org/~bmilekic/code/measure.txt
is that grouped frees will be done as most often as possible without
dropping the cache lock in between. So, for the most part, they'll be
done without the lock being dropped. This is particularly true if you
have something that does a grouped m_getm() or m_getcl() (a cluster and
mbuf at the same time) - most likely getting the buffers from the
same per-CPU cache - and then frees them with m_free{,m}(). Unless
the buffers' underlying buckets were moved, the free will be done without
the lock getting dropped in between. So far, only m_free() has been
shown how to do this, and m_freem() will shortly follow.
Since I'm here, I also fixed a small (but mostly harmless) type-mismatch
introduced in the last commit.
and a cluster in one shot.
o Introduce MBP_PERSIST and MBP_PERSISTENT control bits to mb_alloc();
MBP_PERSIST means "if you can allocate, then keep the cache lock
held on exit," and MBP_PERSISTENT means "a cache lock is alredy held
on entry, so allocate from the specified (already locked) cache."
They may be used in combination.
o m_getcl() uses the MBP_PERSIST/MBP_PERSISTENT interface so that it
doesn't drop the cache lock in between the mbuf and cluster allocations.
o m_getm(), which takes a size and allocates an mbuf + cluster "best fit"
chain, has been moved from uipc_mbuf.c to subr_mbuf.c and shown how to
use MBP_PERSIST/MBP_PERSISTENT to attempt to do a grouped allocation
without dropping the cache lock in between.
Why this is good: much less bus-locked lock acquires/drops when they're
not needed. Also, prototype for m_getcl():
struct mbuf * m_getcl(int how, short type, int flags);
"how" and "type" are self-explanatory. "flags" may be M_PKTHDR, in
which case m_getcl() will make the mbuf a pkthdr-mbuf.
While I'm in subr_mbuf.c:
o Every exported routine now has a nice comment with a description of
the expected arguments. Eventually, mbuf(9) needs to be re-vamped
but there's still more code to write/finalize before I get to that.
o internal macros have been changed a bit.
o consistently use 'short' for "type." This somehow slipped through
before (that 'type' was sometimes declared as int).
Alfred has been pushing for the MBP_PERSIST{,ENT} thing for almost a
year now. Luigi asked for m_getcl(), and will probably MFC that
part of this commit.
TODO [Related]: teach mb_free() about MBP_PERSIST{, ENT}.
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
PAGE_SIZE / MCLBYTES == 1 crash. Fix them by changing the
appropriate "allocate new page and bucket" code in mb_alloc to use
the macro for properly grabbing an allocated object from a bucket,
the one that checks whether the bucket is empty.
This should allow ken to continue testing zero-copy stuff on -CURRENT.
Noticed and provided debug info: ken
A [hopefully] conforming style(9) revamp of mb_alloc and related code.
(This was possible due to bde's remarkable patience.)
Submitted by: (in large part) bde
Reviewed by: (the other part) bde
you run out of mbuf address space.
kern/subr_mbuf.c: print a warning message when mb_alloc fails, again
rate-limited to at most once per second. This covers other
cases of mbuf allocation failures. Probably it also overlaps the
one handled in vm/vm_kern.c, so maybe the latter should go away.
This warning will let us gradually remove the printf that are scattered
across most network drivers to report mbuf allocation failures.
Those are potentially dangerous, in that they are not rate-limited and
can easily cause systems to panic.
Unless there is disagreement (which does not seem to be the case
judging from the discussion on -net so far), and because this is
sort of a safety bugfix, I plan to commit a similar change to STABLE
during the weekend (it affects kern/uipc_mbuf.c there).
Discussed-with: jlemon, silby and -net
For an object type, we maintain a variable mb_mapfull. It is 0 by default
and is only raised to 1 in one place: when an mb_pop_cont() fails for
the first time, on the assumption that the reason for the failure is
due to the underlying map for the object (e.g. clust_map, mbuf_map) being
exhausted.
Problem and Changes:
Change how we define "mb_mapfull." It now means: "set to 1 when the first
mb_pop_cont() fails only in the kmem_malloc()-ing of the object, and
only if the call was with the M_TRYWAIT flag." This is a more conservative
definition and should avoid odd [but theoretically possible] situations
from occuring. i.e. we had set mb_mapfull to 1 thinking the map for the
object was actually exhausted when we _actually_ failed in malloc()ing
the space for the bucket structure managing the objects in the page
we're allocating.
when I changed the allocator bits. This implements per-CPU mbtypes
stats by keeping net number of decrements/increments of a given mbtype
per-CPU and then summing all of the per-CPU mbtypes to produce the total
net number of allocated mbufs of the given mbtype.
Counters are carefully balanced to avoid/prevent underflows/overflows.
mbtypes stats are re-enabled with the idea that we may occasionally
(although very rarely) observe slight inconsistencies in the stat
reporting. Most of the time, we should be fine, though.
Also make appropriate modifications to netstat(1) and systat(1) to do
the necessary reporting.
Submitted by: Jiangyi Liu <jyliu@163.net>
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
order to avoid namespace collision with subr_mchain.c's mb_init(). This
wasn't "fatal" as the mbuf initialization routine mb_init() was local to
subr_mbuf.c which in turn didn't pull in subr_mchain.c's mb_init()
declaration, but it should deffinately be changed now before it creates
headache.
defined to 0 in the non-SMP case, which very much makes sense as it
permits its usage in per-CPU initialization loops (for an example, check
out subr_mbuf.c).
Further, on a UP system, make mb_alloc always use the first per-CPU
container, regardless of cpuid (i.e. remove reliability on cpuid in the
UP case).
Requested by: alfred
initialize in the right order to make derivative settings work right.
eg: at compile time, nmbufs was double nmbclusters. For POLA this should
work the same at runtime.
were indices in a dense array. The cpuids are a sparse set and treat
them as such, setting up containers only for CPUs activated during
mb_init().
- Fix netstat(1) and systat(1) to treat the per-CPU stats area as a sparse
map, in accordance with the above.
This allows us to properly boot with certain CPUs disactivated. However, if
we later decide to re-activate said CPUs, we will barf until we decide to
implement CPU spinon/spinoff callback hooks to allow for said CPUs' per-CPU
containers to get configured on their activation.
Reported by: mjacob
Partially (sys/ diffs) Submitted by: mjacob
introduce a modified allocation mechanism for mbufs and mbuf clusters; one
which can scale under SMP and which offers the possibility of resource
reclamation to be implemented in the future. Notable advantages:
o Reduce contention for SMP by offering per-CPU pools and locks.
o Better use of data cache due to per-CPU pools.
o Much less code cache pollution due to excessively large allocation macros.
o Framework for `grouping' objects from same page together so as to be able
to possibly free wired-down pages back to the system if they are no longer
needed by the network stacks.
Additional things changed with this addition:
- Moved some mbuf specific declarations and initializations from
sys/conf/param.c into mbuf-specific code where they belong.
- m_getclr() has been renamed to m_get_clrd() because the old name is really
confusing. m_getclr() HAS been preserved though and is defined to the new
name. No tree sweep has been done "to change the interface," as the old
name will continue to be supported and is not depracated. The change was
merely done because m_getclr() sounds too much like "m_get a cluster."
- TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and
systat(1) (see TODO below).
- Fixed systat(1) to display number of "free mbufs" based on new per-CPU
stat structures.
- Fixed netstat(1) to display new per-CPU stats based on sysctl-exported
per-CPU stat structures. All infos are fetched via sysctl.
TODO (in order of priority):
- Re-enable mbtypes statistics in both netstat(1) and systat(1) after
introducing an SMP friendly way to collect the mbtypes stats under the
already introduced per-CPU locks (i.e. hopefully don't use atomic() - it
seems too costly for a mere stat update, especially when other locks are
already present).
- Optionally have systat(1) display not only "total free mbufs" but also
"total free mbufs per CPU pool."
- Fix minor length-fetching issues in netstat(1) related to recently
re-enabled option to read mbuf stats from a core file.
- Move reference counters at least for mbuf clusters into an unused portion
of the cluster itself, to save space and need to allocate a counter.
- Look into introducing resource freeing possibly from a kproc.
Reviewed by (in parts): jlemon, jake, silby, terry
Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha)
Preliminary performance measurements: jlemon (and me, obviously)
URL: http://people.freebsd.org/~bmilekic/mb_alloc/