* Fix bug with /32 aliases introduced in 81728a538d.
* Explicitly document business logic for IPv4 ifa routes.
* Remove remnants of rtinit()
* Deduplicate ifa->route prefix code by moving it into ia_getrtprefix()
* Deduplicate conditional check for ifa_maintain_loopback_route() by
moving into ia_need_loopback_route()
* Remove now-unused flags argument from in_addprefix().
Reviewed by: donner
PR: 252883
Differential Revision: https://reviews.freebsd.org/D28246
Summary:
When using the base stack in conjunction with RACK, it appears that
infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh.
This leaves the recover flightsize (sackhint.recover_fs) uninitialized,
leading to a div/0 panic.
Address this by properly initializing the variable just prior to first
use, if it is not properly initialized.
In order to prevent stale information from a prior recovery to
negatively impact the PRR calculations in this event, also clear
recover_fs once loss recovery is finished.
Finally, improve the readability of the initialization of recover_fs
when t_dupacks == tcprexmtthresh by adjusting the indentation and
using the max(1, snd_nxt - snd_una) macro.
Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages
Reviewed By: rrs, kbowling, #transport
Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro
Differential Revision: https://reviews.freebsd.org/D28114
There are many casts of this struct to uint32_t, so we also need to ensure
that it is sufficiently aligned to safely perform this cast on architectures
that don't allow unaligned accesses. This fixes lots of -Wcast-align warnings.
Reviewed By: ae
Differential Revision: https://reviews.freebsd.org/D27879
This fixes -Wcast-align warnings caused by the underaligned `struct ip`.
This also silences them in the public functions by changing the function
signature from char * to void *. This is source and binary compatible and
avoids the -Wcast-align warning.
Reviewed By: ae, gbe (manpages)
Differential Revision: https://reviews.freebsd.org/D27882
rtinit[1]() is a function used to add or remove interface address prefix routes,
similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
in reality.
1) IPv6 code does not use it for the ifa routes. There is a separate layer,
nd6_prelist_(), providing interface for maintaining interface routes. Its part,
responsible for the actual route table interaction, mimics rtenty() code.
2) rtinit tries to combine multiple actions in the same function: constructing
proper route attributes and handling iterations over multiple fibs, for the
non-zero net.add_addr_allfibs use case. It notably increases the code complexity.
3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
for p2p connections, host routes and p2p routes are handled in the same way.
Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
aliases.
4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
complicating rib_action() implementation.
5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
ifa messages in certain corner cases.
To address all these points, the following has been done:
* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
responsible for the actual route notifications.
Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses
Differential revision: https://reviews.freebsd.org/D28186
When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.
Reviewed by: jtl@, rscheff@
Differential Revision: https://reviews.freebsd.org/D28142
Sponsored by: Netflix, Inc.
PR: 252449
MFC after: 1 week
A TCP RST segment should be processed even it is missing TCP
timestamps.
Reported by: dmgk@, kevans@
Reviewed by: rscheff@, dmgk@
Sponsored by: Netflix, Inc.
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28143
Relevant inet/inet6 code has the control over deciding what
the RIB lookup function currently is. With that in mind,
explicitly set it to the current value (rn_match) in the
datapath lookups. This avoids cost on indirect call.
Differential Revision: https://reviews.freebsd.org/D28066
Currently default behaviour is to keep only 1 packet per unresolved entry.
Ability to queue more than one packet was added 10 years ago, in r215207,
though the default value was kep intact.
Things have changed since that time. Systems tend to initiate multiple
connections at once for a variety of reasons.
For example, recent kern/252278 bug report describe happy-eyeball DNS
behaviour sending multiple requests to the DNS server.
The primary driver for upper value for the queue length determination is
memory consumption. Remote actors should not be able to easily exhaust
local memory by sending packets to unresolved arp/ND entries.
For now, bump value to 16 packets, to match Darwin implementation.
The proper approach would be to switch the limit to calculate memory
consumption instead of packet count and limit based on memory.
We should MFC this with a variation of D22447.
Reviewers: #manpages, #network, bz, emaste
Reviewed By: emaste, gbe(doc), jilles(doc)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28068
This change introduces framework that allows to dynamically
attach or detach longest prefix match (lpm) lookup algorithms
to speed up datapath route tables lookups.
Framework takes care of handling initial synchronisation,
route subscription, nhop/nhop groups reference and indexing,
dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
picking the best matching algorithm on-the-fly based on the
amount of routes in the routing table.
Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.
The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
for large-fib (D27412)
Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
radix4: ~20mpps
radix4_lockless: ~24.8mpps
bsearch4: ~69mpps
dpdk_lpm4: ~67 mpps
700k routes:
radix4_lockless: 3.3mpps
dpdk_lpm4: 46mpps
IPv6:
8 routes:
radix6_lockless: ~20mpps
dpdk_lpm6: ~70mpps
100k routes:
radix6_lockless: 13.9mpps
dpdk_lpm6: 57mpps
Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)
Control:
Framwork adds the following runtime sysctls:
List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless
Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.
Differential Revision: https://reviews.freebsd.org/D27401
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
collision. This avouds an out-of-bounce access in case the peer can
break the cookie signature. Thanks to Felix Wilhelm from Google for
reporting the issue.
MFC after: 1 week
a restart.
This fixes a use-after-free scenario, which was reported by Felix
Wilhelm from Google in case a peer is able to modify the cookie.
However, this can also be triggered by an assciation restart under
some specific conditions.
MFC after: 1 week
PRR improves loss recovery and avoids RTOs in a wide range
of scenarios (ACK thinning) over regular SACK loss recovery.
PRR is disabled by default, enable by net.inet.tcp.do_prr = 1.
Performance may be impeded by token bucket rate policers at
the bottleneck, where net.inet.tcp.do_prr_conservate = 1
should be enabled in addition.
Submitted by: Aris Angelogiannopoulos
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D18892
ROUTE_MPATH is the new config option controlling new multipath routing
implementation. Remove the last pieces of RADIX_MPATH-related code and
the config option.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D27244
No functional changes.
* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
(fib<46>_lookup_rt()). These will be used in the control plane code
requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
This change simplifies the switch to alternative lookup implementations,
which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys
Differential Revision: https://reviews.freebsd.org/D27405
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
consistent and prepare for upcoming iterator optimizations.
Differential Revision: https://reviews.freebsd.org/D27219
with to == NULL for SYN segments. So don't assume tp != NULL.
Thanks to jhb@ for reporting and suggesting a fix.
PR: 250499
MFC after: 1 week
XMFC-with: r367530
Sponsored by: Netflix, Inc.
due to its lack of support for ICMP redirects. The following commit
adds redirects to the fastforward path, again allowing for decent
forwarding performance in the kernel.
Reviewed by: ae, melifaro
Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate")
* TCP segments without timestamps should be dropped when support for
the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
for the timestamp option has not been negotiated.
This patch enforces the above.
PR: 250499
Reviewed by: gnn, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D27148
Currently there is no locking done to protect this structure. It is
likely okay due to the low-volume nature of IGMP, but allows for
the possibility of underflow. This appears to be one of the only
holdouts of the conversion to counter(9) which was done for most
protocol stat structures around 2013.
This also updates the visibility of this stats structure so that it can
be consumed from elsewhere in the kernel, consistent with the vast
majority of VNET_PCPUSTAT structures.
Reviewed by: kp
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27023
Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.
Reported by: chengc_netapp.com
Reviewed by: rrs, jtl
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24237
This gives a more uniform API for send tag life cycle management.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000
Send tags are refcounted and if_snd_tag_free() is called by
m_snd_tag_rele() when the last reference is dropped on a send tag.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26995
In r348254, if_snd_tag_alloc() routines were changed to bump the ifp
refcount via m_snd_tag_init(). This function wasn't in the tree at
the time and wasn't updated for the new semantics, so was still doing
a separate bump after if_snd_tag_alloc() returned.
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26999
r350501 added the 'st' parameter, but did not pass it down to
if_snd_tag_alloc().
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26997
- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).
- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.
- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.
- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.
- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.
- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].
Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691
This will pave the way of setting ssthresh differently in TCP CUBIC, according
to RFC8312 section 4.7.
No functional change, only code movement.
Submitted by: chengc_netapp.com
Reviewed by: rrs, tuexen, rscheff
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26807
Pad the icmp6stat structure so that we can add more counters in the
future without breaking compatibility again, last done in r358620.
Annotate the rarely executed error paths with __predict_false while
here.
Reviewed by: bz, melifaro
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26578
connections over multiple paths.
Multipath routing relies on mbuf flowid data for both transit
and outbound traffic. Current code fills mbuf flowid from inp_flowid
for connection-oriented sockets. However, inp_flowid is currently
not calculated for outbound connections.
This change creates simple hashing functions and starts calculating hashes
for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
system.
Reviewed by: glebius (previous version),ae
Differential Revision: https://reviews.freebsd.org/D26523
The staleness reported in an error cause is in us, not ms.
Enforce limits on the life time via sysct; and socket options
consistently. Update the description of the sysctl variable to
use the right unit. Also do some minor cleanups.
This also fixes an interger overflow issue if the peer can
modify the cookie. This was reported by Felix Weinrank by fuzz testing
the userland stack and in
https://oss-fuzz.com/testcase-detail/4800394024452096
MFC after: 3 days
It is lightweight way to check if an IPv4 address exists.
Submitted by: Roy Marples
Reviewed by: gnn, melifaro
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26636