Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.
Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting
Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26409
This change is based on the nexthop objects landed in D24232.
The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.
Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.
With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.
User-visible changes:
The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.
All routes now comes with weight, default weight is 1, maximum is 2^24-1.
Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.
Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1
route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20
netstat -6On
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2
Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default
Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
that rtentry can contain multiple nexthops.
Differential Revision: https://reviews.freebsd.org/D26497
Move the initialization of these variables to the beginning of their
respective functions.
On our end this creates a small amount of unneeded churn, as these
variables are properly initialized before their first use in all cases.
However, changing this benefits at least one downstream consumer
(NetApp) by allowing local and future modifications to these functions
to be made without worrying about where the initialization occurs.
Reviewed by: melifaro, rscheff
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26454
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.
A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.
On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.
An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.
On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.
Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.
Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.
Reviewed by: kib@
Relnotes: Yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25873
This is in preparation for enabling a loadable SCTP stack. Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.
Discussed with: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
fib4_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24976
Currently NH_IS_VALID() simly aliases to RT_LINK_IS_UP(), so we're
checking the same thing twice.
In the near future the implementation of this check will be simpler,
as there are plans to introduce control-plane interface status monitoring
similar to ipfw interface tracker.
The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.
When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped. Failure
to stamp a NIC TLS packet will result in crypto
issues.
Reviewed by: hselasky, rrs
Sponsored by: Netflix, Mellanox
New fib[46]_lookup() functions support multipath transparently.
Given that, switch the last rtalloc_mpath_fib() calls to
dib4_lookup() and eliminate the function itself.
Note: proper flowid generation (especially for the outbound traffic) is a
bigger topic and will be handled in a separate review.
This change leaves flowid generation intact.
Differential Revision: https://reviews.freebsd.org/D24595
This change is build on top of nexthop objects introduced in r359823.
Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.
Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.
Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.
Differential Revision: https://reviews.freebsd.org/D24340
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.
This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.
Many thanks to glebius for helping to make this better in
the Netflix tree.
Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.
Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:
https://queue.acm.org/detail.cfm?id=3022184
or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on implementing V2
of BBR to see if it is an improvement over V1.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D21582
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
If we take the slow path for forwarding we should still tell our
firewalls (hooked through pfil(9)) that we're forwarding. Pass the
ip_output() flags to ip_output_pfil() so it can set the PFIL_FWD flag
when we're forwarding.
MFC after: 1 week
Sponsored by: Axiado
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
lookup KPI in ip_output() like it is already used in ip_forward().
However, when there is no PCB provided we can use fast KPI, gaining
performance advantage.
Typical case when ip_output() is called without a PCB pointer is a
sendto(2) on a not connected UDP socket. In practice DNS servers do
this.
Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D19804
Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.
Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.
Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20028
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros. Correct that by replacing handcrafted
code with proper macro usage.
Reviewed by: karels, kristof
Approved by: bde (mentor)
MFC after: 3 weeks
Sponsored by: John Gilmore
Differential Revision: https://reviews.freebsd.org/D19317
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
r336616 copies inp->inp_options using the m_dup() function.
However, this function expects an mbuf packet header at the beginning,
which is not true in this case.
Therefore, use m_copym() instead of m_dup().
This issue was found by syzkaller.
Reviewed by: mmacy@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D18753
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
locally generated SCTP packets sent over IPv4. This make
the behaviour consistent with IPv6.
Reviewed by: ae@, bz@, jtl@
Approved by: re (kib@)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17406
When getting the number of bytes to checksum make sure to convert the UDP
length to host byte order when the entire header is not in the first mbuf.
Reviewed by: jtl, tuexen, ae
Approved by: re (gjb), jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D17357
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).
This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
Use m_copyback() function to write checksum when it isn't located
in the first mbuf of the chain. Handmade analog doesn't handle the
case when parts of checksum are located in different mbufs.
Also in case when mbuf is too short, m_copyback() will allocate new
mbuf in the chain instead of making out of bounds write.
Also wrap long line and remove now useless KASSERTs.
X-MFC after: r334705
The length of the IP payload is normally equal to the UDP length, UDP Options
(draft-ietf-tsvwg-udp-options-02) suggests using the difference between IP
length and UDP length to create space for trailing data.
Correct checksum length calculation to use the UDP length rather than the IP
length when not offloading UDP checksums.
Approved by: jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D15222
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).
Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.
Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.
Reviewed by: ae, kevans
Differential Revision: https://reviews.freebsd.org/D13715
Add a new macro to clear both the L3 and L2 route caches, to
hopefully prevent future instances where only the L3 cache was
cleared when both should have been.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13989
Reviewed by: karels
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
flowtable anymore (as flowtable was never considered to be useful in
the forwarding path).
Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D11448
ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache
entry because they used their own routes on the stack, not in_pcb routes.
The original model for route caching was callers that provided a route
structure to ip{,6}input() would keep the route, and this model was used
for L2 caching as well. Instead, change L2 caching to be done by default
only when using a route structure in the in_pcb; the pcb deallocation
code frees L2 as well as L3 cacches. A separate change will add route
caching to TCP/IPv6.
Another suggestion was to have the transport protocols indicate willingness
to use L2 caching, but this approach keeps the changes in the network
level
Reviewed by: ae gnn
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10059