While testing simultaneous open TCP with ECN, found that
negotiation fails to arrive at the expected final state.
Reviewed by: tuexen (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23373
Otherwise accept filters compiled into the kernel do not preempt
preloaded accept filter modules. Then, the preloaded file registers its
accept filter module before the kernel, and the kernel's attempt fails
since duplicate accept filter list entries are not permitted. This
causes the preloaded file's module to be released, since
module_register_init() does a lookup by name, so the preloaded file is
unloaded, and the accept filter's callback points to random memory since
preload_delete_name() unmaps the file on x86 as of r336505.
Add a new ACCEPT_FILTER_DEFINE macro which wraps the accept filter and
module definitions, and ensures that a module version is defined.
PR: 245870
Reported by: Thomas von Dein <freebsd@daemon.de>
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
cem noted that on FreeBSD snprintf() can not fail and code should not
check for that.
A followup commit will replace the usage of snprintf() in the SCTP
sources with a variadic macro SCTP_SNPRINTF, which will simply map to
snprintf() on FreeBSD and do a checking similar to r361209 on
other platforms.
Note that in_pcb_lport and in_pcb_lport_dest can be called with a NULL
local address for IPv6 sockets; handle it. Found by syzkaller.
Reported by: cem
MFC after: 1 month
Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.
Reviewed by: tuexen (rscheff, rrs previous version)
MFC after: 1 month
Sponsored by: Forcepoint LLC
Differential Revision: https://reviews.freebsd.org/D24781
Currently NH_IS_VALID() simly aliases to RT_LINK_IS_UP(), so we're
checking the same thing twice.
In the near future the implementation of this check will be simpler,
as there are plans to introduce control-plane interface status monitoring
similar to ipfw interface tracker.
The CU-SeeMe videoconferencing client and associated protocol is at this
point a historical artifact; there is no need to retain support for this
protocol today.
Reviewed by: philip, markj, allanjude
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24790
This problem was found by looking at syzkaller reproducers for some other
problems.
Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24831
help of Michael Tuexen. There was some accounting
errors with TCPFO for bbr and also for both rack
and bbr there was a FO case where we should be
jumping to the just_return_nolock label to
exit instead of returning 0. This of course
caused no timer to be running and thus the
stuck sessions.
Reported by: Michael Tuexen and Skyzaller
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24852
admbugs: 956
Submitted by: markj
Reported by: Vishnu Dev TJ working with Trend Micro Zero Day Initiative
Security: FreeBSD-SA-20:13.libalias
Security: CVE-2020-7455
Security: ZDI-CAN-10849
admbugs: 956
Submitted by: ae
Reported by: Lucas Leong (@_wmliang_) of Trend Micro Zero Day Initiative
Reported by: Vishnu working with Trend Micro Zero Day Initiative
Security: FreeBSD-SA-20:12.libalias
The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.
When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped. Failure
to stamp a NIC TLS packet will result in crypto
issues.
Reviewed by: hselasky, rrs
Sponsored by: Netflix, Mellanox
causes so_reuseport_lb_test to fail since it slows down how quickly the program runs until the timeout occurs
and fails the test
Sponsored by: Netflix inc.
Differential Revision: https://reviews.freebsd.org/D24747
a few extra arguments). Recently that changed to only have one arg extra so
that two ifdefs around the call are no longer needed. Lets take out the
extra ifdef and arg.
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D24736
look at when generating a SACK. This was wrong in case of sequence
numbers wrap arounds.
Thanks to Gwenael FOURRE for reporting the issue for the userland stack:
https://github.com/sctplab/usrsctp/issues/462
MFC after: 3 days
1) When BBR retransmits the syn it was messing up the snd_max
2) When we need to send a RST we might not send it when we should
Reported by: ankitraheja09@gmail.com
Sponsored by: Netflix.com
Differential Revision: https://reviews.freebsd.org/D24693
This was only triggered when setting the IPPROTO_TCP level socket
option TCP_DELACK.
This issue was found by runnning an instance of SYZKALLER.
Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24690
have been made in rack and adds a few fixes in BBR. This also
removes any possibility of incorrectly doing OOB data the stacks
do not support it. Should fix the skyzaller crashes seen in the
past. Still to fix is the BBR issue just reported this weekend
with the SYN and on sending a RST. Note that this version of
rack can now do pacing as well.
Sponsored by:Netflix Inc
Differential Revision:https://reviews.freebsd.org/D24576
if it can support the PRUS option (OOB). And then have
the new function call that to validate and give the
correct error response if needed to the user (rack
and bbr do not support obsoleted OOB data).
Sponsoered by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D24574
After converting routing subsystem customers to use nexthop objects
defined in r359823, some fields in struct rtentry became unused.
This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
along with the code initializing and updating these fields.
Cleanup of the remaining fields will be addressed by D24669.
This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
nexthop and tries to update rte nhop pointer. Only last part is done under
WLOCK.
In the hypothetical scenarious where multiple rtsock clients
repeatedly issue RTM_CHANGE requests for the same route, route may get
updated between read and update operation. This is addressed by retrying
the operation multiple (3) times before returning failure back to the
caller.
Differential Revision: https://reviews.freebsd.org/D24666
They have more differencies than similarities. For now there is lots
of code that would check for M_EXT only and work correctly on M_EXTPG
buffers, so still carry M_EXT bit together with M_EXTPG. However,
prepare some code for explicit check for M_EXTPG.
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598
Running TCP Cubic together with ECN could end up reducing cwnd down to 1 byte, if the
receiver continously sets the ECE flag, resulting in very poor transmission speeds.
In line with RFC6582 App. B, a lower bound of 2 MSS is introduced, as well as a typecast
to prevent any potential integer overflows during intermediate calculation steps of the
adjusted cwnd.
Reported by: Cheng Cui
Reviewed by: tuexen (mentor)
Approved by: tuexen (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D23353
which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets.
Packets with old sequence numbers are ignored and not used to update the send window size.
This might cause the TCP session to hang indefinitely under some circumstances.
Reported by: Cui Cheng
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24515
by not including the SYN bit sequence space in cwnd related calculations.
Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead.
This fixes an off-by-one conformance issue with regular TCP sessions not
using Appropriate Byte Counting (RFC3465), sending one more packet during
the initial window than expected.
PR: 235256
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D19000
Nexthop objects implementation, defined in r359823,
introduced sys/net/route directory intended to hold all
routing-related code. Move recently-introduced route_temporal.c and
private route_var.h header there.
Differential Revision: https://reviews.freebsd.org/D24597
New fib[46]_lookup() functions support multipath transparently.
Given that, switch the last rtalloc_mpath_fib() calls to
dib4_lookup() and eliminate the function itself.
Note: proper flowid generation (especially for the outbound traffic) is a
bigger topic and will be handled in a separate review.
This change leaves flowid generation intact.
Differential Revision: https://reviews.freebsd.org/D24595
r360292 switched most of the remaining routing customers to a new KPI,
leaving a bunch of wrappers for old routing lookup functions unused.
Remove them from the tree as a part of routing cleanup.
Differential Revision: https://reviews.freebsd.org/D24569
- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
authentication algorithms and keys as well as the initial sequence
number.
- When reading from a socket using KTLS receive, applications must use
recvmsg(). Each successful call to recvmsg() will return a single
TLS record. A new TCP control message, TLS_GET_RECORD, will contain
the TLS record header of the decrypted record. The regular message
buffer passed to recvmsg() will receive the decrypted payload. This
is similar to the interface used by Linux's KTLS RX except that
Linux does not return the full TLS header in the control message.
- Add plumbing to the TOE KTLS interface to request either transmit
or receive KTLS sessions.
- When a socket is using receive KTLS, redirect reads from
soreceive_stream() into soreceive_generic().
- Note that this interface is currently only defined for TLS 1.1 and
1.2, though I believe we will be able to reuse the same interface
and structures for 1.3.
latest rack and bbr in from the NF repo. When those come
in the OOB data handling will be fixed where Skyzaller crashes.
Differential Revision: https://reviews.freebsd.org/D24575
Introduce new fib[46]_lookup_debugnet() functions serving as a
special interface for the crash-time operations. Underlying
implementation will try to return lookup result if
datastructures are not corrupted, avoding locking.
Convert debugnet to use fib4_lookup_debugnet() and switch it
to use nexthops instead of rtentries.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D24555
r360292 introduced the wrong order, resulting in returned
nhops not being referenced, despite the fact that references
were requested. That lead to random GPF after using SCTP sockets.
Special defined macro like IPV[46]_SCOPE_GLOBAL will be introduced
soon to reduce the chance of putting arguments in wrong order.
Reported-by: syzbot+5c813c01096363174684@syzkaller.appspotmail.com
This change is build on top of nexthop objects introduced in r359823.
Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.
Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.
Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.
Differential Revision: https://reviews.freebsd.org/D24340
Thanks to Natalie Silvanovich from Google for finding and reporting the
issue found by her in the SCTP userland stack.
MFC after: 3 days
X-MFC with: https://svnweb.freebsd.org/changeset/base/360193
in all cases, by adjust snd_una right after the
connection initialization, to include the one byte
in sequence space occupied by the SYN bit.
This does not change the regular ACK processing,
while making the BYTES_THIS_ACK macro to work properly.
PR: 235256
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D19000
mb_reclaim() calls the protocol drain routines for each protocol in each
domain. Some protocols exist in more than one domain and share drain
routines. In the case of SCTP, it also uses the same drain routine for
its SOCK_SEQPACKET and SOCK_STREAM entries in the same domain.
On systems with INET, INET6, and SCTP all defined, mb_reclaim() calls
sctp_drain() four times. On systems with INET and INET6 defined,
mb_reclaim() calls tcp_drain() twice. mb_reclaim() is the only in-tree
caller of the pr_drain protocol entry.
Eliminate this duplication by ensuring that each pr_drain routine is only
specified for one protocol entry in one domain.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24418
One of the goals of the new routing KPI defined in r359823 is to
entirely hide`struct rtentry` from the consumers. It will allow to
improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
struct rtentry field accesses.
Additionally, with the followup multipath changes, single rtentry can point
to multiple nexthops.
With that in mind, convert rti_filter callback used when traversing the
routing table to accept pair (rt, nhop) instead of nexthop.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24440
MSS in two steps and try each candidate two times. However, if two
candidates are the same (which is the case in TCP/IPv6), this candidate
was tested four times. This patch ensures that each candidate actually
reduced the MSS and is only tested 2 times. This reduces the time window
of missclassifying a temporary outage as an MTU issue.
Reviewed by: jtl
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24308
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.
This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.
Many thanks to glebius for helping to make this better in
the Netflix tree.
Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213
Fix panics related to calling code which expects to be running inside
the NET_EPOCH from outside that epoch.
This leads to panics (with INVARIANTS) such as this one:
panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/netinet/if_ether.c:373
cpuid = 7
time = 1586095719
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0090819700
vpanic() at vpanic+0x182/frame 0xfffffe0090819750
panic() at panic+0x43/frame 0xfffffe00908197b0
arprequest_internal() at arprequest_internal+0x59e/frame 0xfffffe00908198c0
arp_announce_ifaddr() at arp_announce_ifaddr+0x20/frame 0xfffffe00908198e0
carp_master_down_locked() at carp_master_down_locked+0x10d/frame 0xfffffe0090819910
carp_master_down() at carp_master_down+0x79/frame 0xfffffe0090819940
softclock_call_cc() at softclock_call_cc+0x13f/frame 0xfffffe00908199f0
softclock() at softclock+0x7c/frame 0xfffffe0090819a20
ithread_loop() at ithread_loop+0x279/frame 0xfffffe0090819ab0
fork_exit() at fork_exit+0x80/frame 0xfffffe0090819af0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0090819af0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Widen the NET_EPOCH to cover the relevant (callback / task) code.
Differential Revision: https://reviews.freebsd.org/D24302
This is the foundational change for the routing subsytem rearchitecture.
More details and goals are available in https://reviews.freebsd.org/D24141 .
This patch introduces concept of nexthop objects and new nexthop-based
routing KPI.
Nexthops are objects, containing all necessary information for performing
the packet output decision. Output interface, mtu, flags, gw address goes
there. For most of the cases, these objects will serve the same role as
the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
multiple BGP full-views, as these objects will be shared between routing
entries. This allows to store more information in the nexthop.
New KPI:
struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
uint32_t scopeid, uint32_t flags, uint32_t flowid);
These 2 function are intended to replace all all flavours of
<in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous
fib[46]-generation functions.
Upon successful lookup, they return nexthop object which is guaranteed to
exist within current NET_EPOCH. If longer lifetime is desired, one can
specify NHR_REF as a flag and get a referenced version of the nexthop.
Reference semantic closely resembles rtentry one, allowing sed-style conversion.
Additionally, another 2 functions are introduced to support uRPF functionality
inside variety of our firewalls. Their primary goal is to hide the multipath
implementation details inside the routing subsystem, greatly simplifying
firewalls implementation:
int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);
All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
embedding and allowing to support IPv4 link-locals in the future.
Structure changes:
* rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
* rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.
Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.
More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232
Reviewed by: ae,glebius(initial version)
Differential Revision: https://reviews.freebsd.org/D24232
The intended change was
sp->next.tqe_next = NULL;
sp->next.tqe_prev = NULL;
which doesn't fix the issue I'm seeing and the committed fix is
not the intended fix due to copy-and-paste.
Thanks a lot to Conrad Meyer for making me aware of the problem.
Reported by: cem
Split their functionality by moving random seed allocation
to SYSINIT and calling (new) generic multipath function from
standard IPv4/IPv5 RIB init handlers.
Differential Revision: https://reviews.freebsd.org/D24356
Before the change, proxyarp checks for src and dst addresses
were performed using default fib, breaking multi-fib scenario.
PR: 245181
Submitted by: Scott Aitken (original version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24244
for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6.
The current blackhole detection might classify a temporary outage as
an MTU issue and reduces permanently the MSS. Since the consequences of
such a reduction due to a misclassification are much more drastically
for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only.
Reviewed by: bcr@ (man page), Richard Scheffenegger
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24219
parameters of the timer start and stop routines. Several inconsistencies
have been fixed in earlier commits. Now they will be catched when running
an INVARIANTS system.
MFC after: 1 week