Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).
It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).
The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.
Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.
The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.
No other functional change intended.
Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421
Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.
When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.
This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.
Reviewed by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21796
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.
With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.
Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.
Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
Fix a mis-merge when extracting the unmapped mbuf changes from
Netflix's in-kernel TLS changes where the call to the function that
freed the backing pages from an unmapped mbuf was missed.
Sponsored by: Chelsio Communications
This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl.
It is disabled by default.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
anything except several assertions. This type is going to be used for
temporary on stack mbufs, that point into data in receive ring of a NIC,
that shall not be freed. Such mbuf can not be stored or reallocated, its
life time is current context.
The comment states that function always return a top of allocated mbuf;
however, the function actually return the overall mbuf chain top pointer.
Since there are already existing users of it (via m_getm(4) macro),
rephrase the comment and leave behavior unchanged.
PR: 134335
MFC after: 12 days
Correct boneheaded assertion I added in r339501. Mea culpa.
The intent is to notice when an M_WAITOK zone allocation would fail during
netdump, not to prevent all use of mbufs during netdump.
Reviewed by: markj
X-MFC-With: r339501
Differential Revision: https://reviews.freebsd.org/D17957
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.
This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.
Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418
The aim is to permit mbuf allocations after a panic without calling into
the page allocator, without imposing any runtime overhead during regular
operation of the system, and without modifying driver code. The approach
taken is to preallocate a number of mbufs and clusters, storing them
in linked lists, and using the lists to back some UMA cache zones. At
panic time, the mbuf and cluster zone pointers are overwritten with
those of the cache zones so that the mbuf allocator returns
preallocated items.
Using this scheme, drivers which cache mbuf zone pointers from
m_getzone() require special handling when implementing netdump support.
Reviewed by: cem (earlier version), julian, sbruno
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15251
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
o Fall back to default m_ext free mech, using function pointer in
m_ext_free, and remove sf_ext_free() called directly from mbuf code.
Testing on modern CPUs showed no regression.
o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses
SF_SYNC flag. Lack of the flag allows us not to dereference
ext_arg2, saving from a cache line miss.
o Create function sendfile_free_page() that later will be used, for
multi-page mbufs. For now compiler will inline it into
sendfile_free_mext().
In collaboration with: gallatin
Differential Revision: https://reviews.freebsd.org/D12615
in mb_free_ext() always use fields from the original refcount holding
mbuf (see. r296242) mbuf. Cuts another cache miss from mb_free_ext().
However, treat EXT_EXTREF mbufs differently, since they are different -
they don't have a refcount holding mbuf.
Provide longer comments in m_ext declaration to explain this change
and change from r296242.
In collaboration with: gallatin
Differential Revision: https://reviews.freebsd.org/D12615
All of these arguments are stored in m_ext, so there is no reason
to pass them in the argument list. Not all functions need the second
argument, some don't even need the first one. The second argument
lives in next cache line, so not dereferencing it is a performance
gain. This was discovered in sendfile(2), which will be covered by
next commits.
The second goal of this commit is to bring even more flexibility
to m_ext mbufs, allowing to create more fields in m_ext, opaque to
the generic mbuf code, and potentially set and dereferenced by
subsystems.
Reviewed by: gallatin, kbowling
Differential Revision: https://reviews.freebsd.org/D12615
"iflib is a library to eliminate the need for frequently duplicated device
independent logic propagated (poorly) across many network drivers."
Participation is purely optional. The IFLIB kernel config option is
provided for drivers that want to transition between legacy and iflib
modes of operation. ixl and ixgbe driver conversions will be committed
shortly. We hope to see participation from the Broadcom and maybe
Chelsio drivers in the near future.
Submitted by: mmacy@nextbsd.org
Reviewed by: gallatin
Differential Revision: D5211
m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE. The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl. M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.
Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.
Update netmap_get_mbuf to not pass M_NOFREE to m_getcl. It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.
Reviewed by: gnn@
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5698
The mbuf provider is made up of a set of Statically Defined Tracepoints
which help us look into mbufs as they are allocated and freed. This can be
used to inspect the buffers or for a simplified mbuf leak detector.
New tracepoints are:
mbuf:::m-init
mbuf:::m-gethdr
mbuf:::m-get
mbuf:::m-getcl
mbuf:::m-clget
mbuf:::m-cljget
mbuf:::m-cljset
mbuf:::m-free
mbuf:::m-freem
There is also a translator for mbufs which gives some visibility into the structure,
see mbuf.d for more details.
Reviewed by: bz, markj
MFC after: 2 weeks
Sponsored by: Rubicon Communications (Netgate)
Differential Revision: https://reviews.freebsd.org/D5682
EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free
callback, then free the mbuf, otherwise we will derefernce memory that
was just freed.
Reported and tested: jhibbits
The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.
The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.
For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.
For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.
Discussed with: rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D5396
Sponsored by: Netflix
mbuf(9) manipulation functions into uipc_mbuf.c. This looks like
the initial intent, but had diffused in the last decade.
o Gather all declarations in mbuf.h in one place and sort them.
o Uninline m_clget() and m_cljget().
There are no functional changes in this patch.
The patch comes from a larger version, where all mbuf(9) allocation was
uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c.
The performance impact of the total uninlining is still unclear, so we
are holding on now with larger version.
Together with: melifaro, olivier
exhausted.
It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.
While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.
This patch makes two changes:
a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.
b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().
Differential Revision: https://reviews.freebsd.org/D3864
Reviewed by: gnn
Comments by: rrs, glebius
MFC after: 2 weeks
Sponsored by: Juniper Networks
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.
Note to self: When this is MFCed, sparc64 needs the same fix.
Differential revision: https://reviews.freebsd.org/D2106
Reviewed by: kib
Reported by: Michael Fuckner <michael@fuckner.net>
Tested by: Michael Fuckner <michael@fuckner.net>
MFC after: 2 weeks
o Do not use UMA refcount zone. The problem with this zone is that
several refcounting words (16 on amd64) share the same cache line,
and issueing atomic(9) updates on them creates cache line contention.
Also, allocating and freeing them is extra CPU cycles.
Instead, refcount the page directly via vm_page_wire() and the sfbuf
via sf_buf_alloc(sf_buf_page(sf)) [1].
o Call refcounting/freeing function for EXT_SFBUF via direct function
call, instead of function pointer. This removes barrier for CPU
branch predictor.
o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to
satisfy assertion in mb_dtor_mbuf(). Remove the assertion from
mb_dtor_mbuf(). Use bcopy() instead of manual assignments to
copy m_ext in mb_dupcl().
[1] This has some problems for now. Using sf_buf_alloc() merely to
increase refcount is expensive, and is broken on sparc64. To be
fixed.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
2 predictable branches nowadays. However as a pre-condition the
caller had to ensure that the mbuf pkthdr did not have any mtags
attached to it, costing some potential branches again.
Sponsored by: The FreeBSD Foundation
features. The changes in particular are:
o Remove rarely used "header" pointer and replace it with a 64bit protocol/
layer specific union PH_loc for local use. Protocols can flexibly overlay
their own 8 to 64 bit fields to store information while the packet is
worked on.
o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
instead of pkthdr.header.
o Extend csum_flags to 64bits to allow for additional future offload
information to be carried (e.g. iSCSI, IPsec offload, and others).
o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
rsstype field. Adjust accessor macros.
o Add cosqos field to store Class of Service / Quality of Service information
with the packet. It is not yet supported in any drivers but allows us to
get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
a modernized ALTQ.
o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
from the start of the packet. This is important for various offload
capabilities and to relieve the drivers from having to parse the packet
and protocol headers to find out location of checksums and other
information. Header parsing in drivers is a lot of copy-paste and
unhandled corner cases which we want to avoid.
o Add another flexible 64bit union to map various additional persistent
packet information, like ether_vtag, tso_segsz and csum fields.
Depending on the csum_flags settings some fields may have different usage
making it very flexible and adaptable to future capabilities.
o Restructure the CSUM flags to better signify their outbound (down the
stack) and inbound (up the stack) use. The CSUM flags used to be a bit
chaotic and rather poorly documented leading to incorrect use in many
places. Bring clarity into their use through better naming.
Compatibility mappings are provided to preserve the API. The drivers
can be corrected one by one and MFC'd without issue.
o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).
Sponsored by: The FreeBSD Foundation
to 8 bits. ext_type is an enumerator and the number of types we
have is a mere dozen.
A couple of ext_types are renumbered to fit within 8 bits.
EXT_VENDOR[1-4] and EXT_EXP[1-4] types for vendor-internal and
experimental local mapping.
The ext_flags field is currently unused but has a couple of flags
already defined for future use. Again vendor and experimental
flags are provided for local mapping.
EXT_FLAG_BITS is provided for the printf(9) %b identifier.
Initialize and copy ext_flags in the relevant mbuf functions.
Improve alignment and packing of struct m_ext on 32 and 64 archs
by carefully sorting the fields.
for a very long time, if ever.
Should such a functionality ever be needed again the appropriate and
much better way to do it is through a custom EXT_SOMETHING external mbuf
type together with a dedicated *ext_free function.
Discussed with: trociny, glebius