Commit Graph

113 Commits

Author SHA1 Message Date
John Baldwin
afa60c068e Invoke ext_free function when freeing an unmapped mbuf.
Fix a mis-merge when extracting the unmapped mbuf changes from
Netflix's in-kernel TLS changes where the call to the function that
freed the backing pages from an unmapped mbuf was missed.

Sponsored by:	Chelsio Communications
2019-07-02 22:58:21 +00:00
John Baldwin
cec06a3edc Add support for using unmapped mbufs with sendfile(2).
This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl.
It is disabled by default.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:49:35 +00:00
John Baldwin
82334850ea Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages.  It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer.  This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers).  It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused.  To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability.  This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands.  For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output.  If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Discussed with:	ae, kp (firewalls)
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
Gleb Smirnoff
eec189c70b Add new m_ext type for data for M_NOFREE mbufs, which doesn't actually do
anything except several assertions.  This type is going to be used for
temporary on stack mbufs, that point into data in receive ring of a NIC,
that shall not be freed.  Such mbuf can not be stored or reallocated, its
life time is current context.
2019-01-31 22:37:28 +00:00
Andriy Voskoboinyk
58c43838f7 m_getm2: correct a comment.
The comment states that function always return a top of allocated mbuf;
however, the function actually return the overall mbuf chain top pointer.

Since there are already existing users of it (via m_getm(4) macro),
rephrase the comment and leave behavior unchanged.

PR:		134335
MFC after:	12 days
2019-01-27 16:44:27 +00:00
Conrad Meyer
0d1467b199 netdump: Fix netdumping with INVARIANTS kernels
Correct boneheaded assertion I added in r339501.  Mea culpa.

The intent is to notice when an M_WAITOK zone allocation would fail during
netdump, not to prevent all use of mbufs during netdump.

Reviewed by:	markj
X-MFC-With:	r339501
Differential Revision:	https://reviews.freebsd.org/D17957
2018-11-12 05:24:20 +00:00
Mark Johnston
9978bd996b Add malloc_domainset(9) and _domainset variants to other allocator KPIs.
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain.  The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted.  Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with:	gallatin, jeff
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17418
2018-10-30 18:26:34 +00:00
Conrad Meyer
3937ee7557 netdump: Zone mbufs should be allocated before dump
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17306
2018-10-20 22:24:58 +00:00
Mark Johnston
5475ca5aca Add an mbuf allocator for netdump.
The aim is to permit mbuf allocations after a panic without calling into
the page allocator, without imposing any runtime overhead during regular
operation of the system, and without modifying driver code. The approach
taken is to preallocate a number of mbufs and clusters, storing them
in linked lists, and using the lists to back some UMA cache zones. At
panic time, the mbuf and cluster zone pointers are overwritten with
those of the cache zones so that the mbuf allocator returns
preallocated items.

Using this scheme, drivers which cache mbuf zone pointers from
m_getzone() require special handling when implementing netdump support.

Reviewed by:	cem (earlier version), julian, sbruno
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15251
2018-05-06 00:19:48 +00:00
Mark Johnston
c2ba2d1b0e Style.
MFC after:	3 days
2018-05-06 00:11:30 +00:00
Jeff Roberson
ab3185d15e Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
Pedro F. Giffuni
8a36da99de sys/kern: adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:20:12 +00:00
Gleb Smirnoff
9c82bec42d Improvements to sendfile(2) mbuf free routine.
o Fall back to default m_ext free mech, using function pointer in
  m_ext_free, and remove sf_ext_free() called directly from mbuf code.
  Testing on modern CPUs showed no regression.
o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses
  SF_SYNC flag.  Lack of the flag allows us not to dereference
  ext_arg2, saving from a cache line miss.
o Create function sendfile_free_page() that later will be used, for
  multi-page mbufs.  For now compiler will inline it into
  sendfile_free_mext().

In collaboration with:	gallatin
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 21:06:16 +00:00
Gleb Smirnoff
07e87a1d55 In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively,
in mb_free_ext() always use fields from the original refcount holding
mbuf (see. r296242) mbuf.  Cuts another cache miss from mb_free_ext().

However, treat EXT_EXTREF mbufs differently, since they are different -
they don't have a refcount holding mbuf.

Provide longer comments in m_ext declaration to explain this change
and change from r296242.

In collaboration with:	gallatin
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 20:51:58 +00:00
Gleb Smirnoff
e8fd18f306 Shorten list of arguments to mbuf external storage freeing function.
All of these arguments are stored in m_ext, so there is no reason
to pass them in the argument list.  Not all functions need the second
argument, some don't even need the first one.  The second argument
lives in next cache line, so not dereferencing it is a performance
gain.  This was discovered in sendfile(2), which will be covered by
next commits.

The second goal of this commit is to bring even more flexibility
to m_ext mbufs, allowing to create more fields in m_ext, opaque to
the generic mbuf code, and potentially set and dereferenced by
subsystems.

Reviewed by:	gallatin, kbowling
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 20:35:31 +00:00
Scott Long
4c7070db25 Import the 'iflib' API library for network drivers. From the author:
"iflib is a library to eliminate the need for frequently duplicated device
independent logic propagated (poorly) across many network drivers."

Participation is purely optional.  The IFLIB kernel config option is
provided for drivers that want to transition between legacy and iflib
modes of operation.  ixl and ixgbe driver conversions will be committed
shortly.  We hope to see participation from the Broadcom and maybe
Chelsio drivers in the near future.

Submitted by:   mmacy@nextbsd.org
Reviewed by:    gallatin
Differential Revision:  D5211
2016-05-18 04:35:58 +00:00
Gleb Smirnoff
17cd649f4a Add a comment and KASSERT that a M_NOFREE mbuf has always EXT_EXTREF ext.
Submitted by:	kmacy
2016-05-17 23:15:16 +00:00
Pedro F. Giffuni
e3043798aa sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
Navdeep Parhar
fddd4f6273 Plug leak in m_unshare.
m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE.  The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl.  M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.

Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.

Update netmap_get_mbuf to not pass M_NOFREE to m_getcl.  It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.

Reviewed by:	gnn@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5698
2016-03-26 23:39:53 +00:00
Gleb Smirnoff
c8f59118be Space and style(9) corrections for recent mbuf changes. 2016-03-24 20:06:52 +00:00
George V. Neville-Neil
480f4e946d Add an mbuf provider to DTrace.
The mbuf provider is made up of a set of Statically Defined Tracepoints
which help us look into mbufs as they are allocated and freed.  This can be
used to inspect the buffers or for a simplified mbuf leak detector.

New tracepoints are:

mbuf:::m-init
mbuf:::m-gethdr
mbuf:::m-get
mbuf:::m-getcl
mbuf:::m-clget
mbuf:::m-cljget
mbuf:::m-cljset
mbuf:::m-free
mbuf:::m-freem

There is also a translator for mbufs which gives some visibility into the structure,
see mbuf.d for more details.

Reviewed by:	bz, markj
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D5682
2016-03-22 13:16:52 +00:00
Gleb Smirnoff
0ea37a86ba Fix regression in r296242 affecting several drivers. For EXT_NET_DRV,
EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free
callback, then free the mbuf, otherwise we will derefernce memory that
was just freed.

Reported and tested:	jhibbits
2016-03-02 02:12:01 +00:00
Gleb Smirnoff
56a5f52e80 New way to manage reference counting of mbuf external storage.
The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with:		rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D5396
Sponsored by:		Netflix
2016-03-01 00:17:14 +00:00
Gleb Smirnoff
5e4bc63b7c o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all
mbuf(9) manipulation functions into uipc_mbuf.c.  This looks like
  the initial intent, but had diffused in the last decade.

o Gather all declarations in mbuf.h in one place and sort them.

o Uninline m_clget() and m_cljget().

There are no functional changes in this patch.

The patch comes from a larger version, where all mbuf(9) allocation was
uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c.
The performance impact of the total uninlining is still unclear, so we
are holding on now with larger version.

Together with:	melifaro, olivier
2016-02-11 21:32:23 +00:00
Gleb Smirnoff
b4b12e52fb Garbage collect unused arguments of m_init(). 2016-02-10 18:54:18 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Gleb Smirnoff
9542ea7b80 Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.
2016-02-03 22:02:36 +00:00
Jonathan T. Looney
54503a13d8 Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
Ryan Stone
f2c2231e0c Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
Navdeep Parhar
a9fa76f27e Test for absence of M_NOFREE before attempting to purge the mbuf's tags.
This will leave more state intact should the assertion go off.

MFC after:	1 month
2014-09-30 23:16:26 +00:00
Gleb Smirnoff
1fbe6a82f4 Improve reference counting of EXT_SFBUF pages attached to mbufs.
o Do not use UMA refcount zone. The problem with this zone is that
  several refcounting words (16 on amd64) share the same cache line,
  and issueing atomic(9) updates on them creates cache line contention.
  Also, allocating and freeing them is extra CPU cycles.
  Instead, refcount the page directly via vm_page_wire() and the sfbuf
  via sf_buf_alloc(sf_buf_page(sf)) [1].

o Call refcounting/freeing function for EXT_SFBUF via direct function
  call, instead of function pointer. This removes barrier for CPU
  branch predictor.

o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to
  satisfy assertion in mb_dtor_mbuf(). Remove the assertion from
  mb_dtor_mbuf(). Use bcopy() instead of manual assignments to
  copy m_ext in mb_dupcl().

[1] This has some problems for now. Using sf_buf_alloc() merely to
    increase refcount is expensive, and is broken on sparc64. To be
    fixed.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-07-11 19:40:50 +00:00
Gleb Smirnoff
fcc34a238c Fix style bug: rename the refcount field of m_ext to ext_cnt, to match
other members.

Sponsored by:	Nginx, Inc.
2014-07-11 14:34:29 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Ed Maste
b7bd677fe1 Initialise m_pkthdr via bzero instead of explicitly zeroing each member
Sponsored by:	The FreeBSD Foundation
2014-04-04 21:09:06 +00:00
John Baldwin
d251e7006b Ignore attempts to set the nmbcluster sysctls to their current value
rather than failing with an error.

Reviewed by:	andre
Approved by:	re (delphij)
MFC after:	2 weeks
2013-10-10 16:11:34 +00:00
Hiren Panchasara
b6f49c23a3 Fixing a small typo.
Reviewed by:	gjb
Approved by:	sbruno (mentor)
2013-09-05 18:18:23 +00:00
Andre Oppermann
ce28636bcf After r254779 "error" must always be present in mb_ctor_pack(),
not only when MAC is defined.

Reported by:	gjb / tinderbox
Sponsored by:	The FreeBSD Foundation
2013-08-24 21:25:53 +00:00
Andre Oppermann
ce6169e715 Remove unused m_free_fast(). The difference to m_free() is only
2 predictable branches nowadays.  However as a pre-condition the
caller had to ensure that the mbuf pkthdr did not have any mtags
attached to it, costing some potential branches again.

Sponsored by:	The FreeBSD Foundation
2013-08-24 21:09:57 +00:00
Andre Oppermann
1b4381afbb Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
Andre Oppermann
894734cbd6 dd a 24 bits wide ext_flags field to m_ext by reducing ext_type
to 8 bits.  ext_type is an enumerator and the number of types we
have is a mere dozen.

A couple of ext_types are renumbered to fit within 8 bits.

EXT_VENDOR[1-4] and EXT_EXP[1-4] types for vendor-internal and
experimental local mapping.

The ext_flags field is currently unused but has a couple of flags
already defined for future use.  Again vendor and experimental
flags are provided for local mapping.

EXT_FLAG_BITS is provided for the printf(9) %b identifier.

Initialize and copy ext_flags in the relevant mbuf functions.

Improve alignment and packing of struct m_ext on 32 and 64 archs
by carefully sorting the fields.
2013-08-24 13:15:42 +00:00
Andre Oppermann
afb295cc9a Avoid code duplication for mbuf initialization and use m_init() instead
in mb_ctor_mbuf() and mb_ctor_pack().
2013-08-24 12:24:58 +00:00
Andre Oppermann
1227f20d26 Revert r254520 and resurrect the M_NOFREE mbuf flag and functionality.
Requested by:	np, grehan
2013-08-21 18:12:04 +00:00
Andre Oppermann
aa3cb8fb64 Remove the unused M_NOFREE mbuf flag. It didn't have any in-tree users
for a very long time, if ever.

Should such a functionality ever be needed again the appropriate and
much better way to do it is through a custom EXT_SOMETHING external mbuf
type together with a dedicated *ext_free function.

Discussed with:	trociny, glebius
2013-08-19 11:16:53 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Gleb Smirnoff
59b9c4f289 Nuke mbstat. It wasn't used for mbuf statistics since FreeBSD 5.
Now that r253351 moved sendfile() stats to a separate struct, the
last field used in mbstat is m_mcfail, which is updated, but never
read or obtained from userland.
2013-07-15 12:18:36 +00:00
Andrey V. Elsukov
05d1f5bce0 Introduce new structure sfstat for collecting sendfile's statistics
and remove corresponding fields from struct mbstat. Use PCPU counters
and SFSTAT_INC() macro for update these statistics.

Discussed with:	glebius
2013-07-15 06:16:57 +00:00
Andre Oppermann
bc4a1b8ccd Make use of the fact that uma_zone_set_max(9) already returns the
rounded limit making a call to uma_zone_get_max(9) unnecessary.

MFC after:	1 day
2013-07-11 12:53:13 +00:00