Commit Graph

6992 Commits

Author SHA1 Message Date
Randall Stewart
b45daaea95 tcp: LRO timestamps have lost their previous precision
Recently we had a rewrite to tcp_lro.c that was tested but one subtle change
was the move to a less precise timestamp. This causes all kinds of chaos
in tcp's that do pacing and needs to be fixed to use the more precise
time that was there before.

Reviewed by: mtuexen, gallatin, hselasky
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30695
2021-06-09 13:58:54 -04:00
Randall Stewart
4747500dea tcp: A better fix for the previously attempted fix of the ack-war issue with tcp.
So it turns out that my fix before was not correct. It ended with us failing
some of the "improved" SYN tests, since we are not in the correct states.
With more digging I have figured out the root of the problem is that when
we receive a SYN|FIN the reassembly code made it so we create a segq entry
to hold the FIN. In the established state where we were not in order this
would be correct i.e. a 0 len with a FIN would need to be accepted. But
if you are in a front state we need to strip the FIN so we correctly handle
the ACK but ignore the FIN. This gets us into the proper states
and avoids the previous ack war.

I back out some of the previous changes but then add a new change
here in tcp_reass() that fixes the root cause of the issue. We still
leave the rack panic fixes in place however.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30627
2021-06-04 05:26:43 -04:00
Mark Johnston
f96603b56f tcp, udp: Permit binding with AF_UNSPEC if the address is INADDR_ANY
Prior to commit f161d294b we only checked the sockaddr length, but now
we verify the address family as well.  This breaks at least ttcp.  Relax
the check to avoid breaking compatibility too much: permit AF_UNSPEC if
the address is INADDR_ANY.

Fixes:		f161d294b
Reported by:	Bakul Shah <bakul@iitbombay.org>
Reviewed by:	tuexen
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30539
2021-05-31 18:53:34 -04:00
Lutz Donnerhacke
bec0a5dca7 libalias: Remove LibAliasCheckNewLink
Finally drop the function in 14-CURRENT.

Discussed with: kp
Differential Revision: https://reviews.freebsd.org/D30275
2021-05-31 13:04:11 +02:00
Lutz Donnerhacke
bfd41ba1fe libalias: Remove unused function LibAliasCheckNewLink
The functionality to detect a newly created link after processing a
single packet is decoupled from the packet processing.  Every new
packet is processed asynchronously and will reset the indicator, hence
the function is unusable.  I made a Google search for third party code,
which uses the function, and failed to find one.

That's why the function should be removed: It unusable and unused.
A much simplified API/ABI will remain in anything below 14.

Discussed with:	kp
Reviewed by:	manpages (bcr)
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30275
2021-05-31 12:53:57 +02:00
Wojciech Macek
d40cd26a86 ip_mroute: rework ip_mroute
Approved by:     mw
Obtained from:   Semihalf
Sponsored by:    Stormshield
Differential Revision: https://reviews.freebsd.org/D30354

Changes:
1. add spinlock to bw_meter

If two contexts read and modify bw_meter values
it might happen that these are corrupted.
Guard only code fragments which do read-and-modify.
Context which only do "reads" are not done inside
spinlock block. The only sideffect that can happen is
an 1-p;acket outdated value reported back to userspace.

2. replace all locks with a single RWLOCK

Multiple locks caused a performance issue in routing
hot path, when two of them had to be taken. All locks
were replaced with single RWLOCK which makes the hot
path able to take only shared access to lock most of
the times.
All configuration routines have to take exclusive lock
(as it was done before) but these operation are very rare
compared to packet routing.

3. redesign MFC expire and UPCALL expire

Use generic kthread and cv_wait/cv_signal for deferring
work. Previously, upcalls could be sent from two contexts
which complicated the design. All upcall sending is now
done in a kthread which allows hot path to work more
efficient in some rare cases.

4. replace mutex-guarded linked list with lock free buf_ring

All message and data is now passed over lockless buf_ring.
This allowed to remove some heavy locking when linked
lists were used.
2021-05-31 05:48:15 +02:00
Lutz Donnerhacke
b03a41befe libalias: Fix nameing and initialization of a constant
The commit 189f8eea contains a refactorisation of a constant.  During
later review D30283 the naming of the constant was improved and the
initialization became explicit.  Put this into the tree, in order to
MFC the correct naming.
2021-05-30 15:47:29 +02:00
Randall Stewart
8c69d988a8 tcp: When we have an out-of-order FIN we do want to strip off the FIN bit.
The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr).
However there was a missing case, i.e. where we get an out-of-order FIN by itself.
In such a case we don't want to leave the FIN bit set, otherwise we will do the
wrong thing and ack the FIN incorrectly. Instead we need to go through the
tcp_reasm() code and that way the FIN will be stripped and all will be well.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30497
2021-05-27 10:50:32 -04:00
Richard Scheffenegger
c358f1857f tcp: Use local CC data only in the correct context
Most CC algos do use local data, and when calling
newreno_cong_signal from there, the latter misinterprets
the data as its own struct, leading to incorrect behavior.

Reported by:  chengc_netapp.com
Reviewed By:  chengc_netapp.com, tuexen, #transport
MFC after:    3 days
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D30470
2021-05-26 20:15:53 +02:00
Randall Stewart
4f3addd94b tcp: Add a socket option to rack so we can test various changes to the slop value in timers.
Timer_slop, in TCP, has been 200ms for a long time. This value dates back
a long time when delayed ack timers were longer and links were slower. A
200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that
lowering this value to something more in line with todays delayed ack values (40ms)
might improve TCP. This bit of code makes it so rack can, via a socket option,
adjust the timer slop.

Reviewed by: mtuexen
Sponsered by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30249
2021-05-26 06:43:30 -04:00
Andrew Gallatin
086a35562f tcp: enter network epoch when calling tfb_tcp_fb_fini
We need to enter the network epoch when calling into
tfb_tcp_fb_fini.  I noticed this when I hit an assert
running the latest rack

Differential Revision: https://reviews.freebsd.org/D30407
Reviewed by: rrs, tuexen
Sponsored by: Netflix
2021-05-25 13:45:37 -04:00
Randall Stewart
13c0e198ca tcp: Fix bugs related to the PUSH bit and rack and an ack war
Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed
in the last commit. The problem is the left edge gets transmitted before the adjustments are done
to the send_map, this means that right edge bits must be considered to be added only if
the entire RSM is being retransmitted.

Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns
out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself.
After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically
what happens is we go into the reassembly code and lose the FIN bit. The trick here is we
should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything.
That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no
timers running. This is because the usrclosed function gets called and the FIN's and such have
already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get
stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly
recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE
before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2
timer can speed this up in testing.

Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30451
2021-05-25 13:23:31 -04:00
Michael Tuexen
9bbd1a8fcb tcp: fix a RACK socket buffer lock issue
Fix a missing socket buffer unlocking of the socket receive buffer.

Reviewed by:		gallatin, rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D30402
2021-05-24 20:31:23 +02:00
Randall Stewart
631449d5d0 tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's
The push bit itself was also not actually being properly moved to
the right edge. The FIN bit was incorrectly on the left edge. We
fix these two issues as well as plumb in the mtu_change for
alternate stacks.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30413
2021-05-24 14:42:15 -04:00
Zhenlei Huang
03b0505b8f ip_forward: Restore RFC reference
Add RFC reference lost in 3d846e4822

PR:		255388
Reviewed By:	rgrimes, donner, karels, marcus, emaste
MFC after:	27 days
Differential Revision: https://reviews.freebsd.org/D30374
2021-05-23 00:01:37 +02:00
Michael Tuexen
8923ce6304 tcp: Handle stack switch while processing socket options
Handle the case where during socket option processing, the user
switches a stack such that processing the stack specific socket
option does not make sense anymore. Return an error in this case.

MFC after:		1 week
Reviewed by:		markj
Reported by:		syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com
Sponsored by:		Netflix, Inc.
Differential revision:	https://reviews.freebsd.org/D30395
2021-05-22 14:39:36 +02:00
Richard Scheffenegger
3975688563 rack: honor prior socket buffer lock when doing the upcall
While partially reverting D24237 with D29690, due to introducing some
unintended effects for in-kernel TCP consumers, the preexisting lock
on the socket send buffer was not considered properly.

Found by: markj
MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D30390
2021-05-22 00:09:59 +02:00
Mark Johnston
7d2608a5d2 tcp: Make error handling in tcp_usr_send() more consistent
- Free the input mbuf in a single place instead of in every error path.
- Handle PRUS_NOTREADY consistently.
- Flush the socket's send buffer if an implicit connect fails.  At that
  point the mbuf has already been enqueued but we don't want to keep it
  in the send buffer.

Reviewed by:	gallatin, tuexen
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30349
2021-05-21 17:45:18 -04:00
Richard Scheffenegger
032bf749fd [tcp] Keep socket buffer locked until upcall
r367492 would unlock the socket buffer before eventually calling the upcall.
This leads to problematic interaction with NFS kernel server/client components
(MP threads) accessing the socket buffer with potentially not correctly updated
state.

Reported by: rmacklem
Reviewed By: tuexen, #transport
Tested by: rmacklem, otis
MFC after: 2 weeks
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29690
2021-05-21 11:07:51 +02:00
Michael Tuexen
500eb6dd80 tcp: Fix sending of TCP segments with IP level options
When bringing in TCP over UDP support in
https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605,
the length of IP level options was considered when locating the
transport header. This was incorrect and is fixed by this patch.

X-MFC with:		https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605
MFC after:		3 days
Reviewed by:		markj, rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D30358
2021-05-21 09:49:45 +02:00
Wojciech Macek
eedbbec3fd ip_mroute: remove unused declarations
fix build for non-x86 targets
2021-05-21 08:01:26 +02:00
Wojciech Macek
741afc6233 ip_mroute: refactor bw_meter API
API should work as following:
- periodicaly report Lower-or-EQual bandwidth (LEQ) connections
  over kernel socket, if user application registered for such
  per-flow notifications
- report Grater-or-EQual (GEQ) bandwidth as soon as it reaches
  specified value in configured time window

Custom implementation of callouts was removed. There is no
point of doing calout-wheel here as generic callouts are
doing exactly the same. The performance is not critical
for such reporting, so the biggest concern should be
to have a code which can be easily maintained.

This is ia preparation for locking rework which is highly inefficient.

Approved by:    mw
Sponsored by:   Stormshield
Obtained from:  Semihalf
Differential Revision:  https://reviews.freebsd.org/D30210
2021-05-21 06:43:41 +02:00
Wojciech Macek
787845c0e8 Revert "ip_mroute: refactor bw_meter API"
This reverts commit d1cd99b147.
2021-05-20 12:14:58 +02:00
Wojciech Macek
d1cd99b147 ip_mroute: refactor bw_meter API
API should work as following:
- periodicaly report Lower-or-EQual bandwidth (LEQ) connections
  over kernel socket, if user application registered for such
  per-flow notifications
- report Grater-or-EQual (GEQ) bandwidth as soon as it reaches
  specified value in configured time window

Custom implementation of callouts was removed. There is no
point of doing calout-wheel here as generic callouts are
doing exactly the same. The performance is not critical
for such reporting, so the biggest concern should be
to have a code which can be easily maintained.

This is ia preparation for locking rework which is highly inefficient.

Approved by:    mw
Sponsored by:   Stormshield
Obtained from:  Semihalf
Differential Revision:  https://reviews.freebsd.org/D30210
2021-05-20 10:13:55 +02:00
Zhenlei Huang
3d846e4822 Do not forward datagrams originated by link-local addresses
The current implement of ip_input() reject packets destined for
169.254.0.0/16, but not those original from 169.254.0.0/16 link-local
addresses.

Fix to fully respect RFC 3927 section 2.7.

PR:		255388
Reviewed by:	donner, rgrimes, karels
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D29968
2021-05-18 22:59:46 +02:00
Lutz Donnerhacke
2e6b07866f libalias: Ensure ASSERT behind varable declarations
At some places the ASSERT was inserted before variable declarations are
finished.  This is fixed now.

Reported by:	kib
Reviewed by:	kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30282
2021-05-16 02:28:36 +02:00
Lutz Donnerhacke
189f8eea13 libalias: replace placeholder with static constant
The field nullAddress in struct libalias is never set and never used.
It exists as a placeholder for an unused argument only.

Reviewed by:	hselasky
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D30253
2021-05-15 09:05:30 +02:00
Lutz Donnerhacke
effc8e57fb libalias: Style cleanup
libalias is a convolut of various coding styles modified by a series
of different editors enforcing interesting convetions on spacing and
comments.

This patch is a baseline to start with a perfomance rework of
libalias.  Upcoming patches should be focus on the code, not on the
style.  That's why most annoying style errors should be fixed
beforehand.

Reviewed by:	hselasky
Discussed by:	emaste
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D30259
2021-05-15 08:57:55 +02:00
Randall Stewart
02cffbc250 tcp: Incorrect KASSERT causes a panic in rack
Skyzall found an interesting panic in rack. When a SYN and FIN are
both sent together a KASSERT gets tripped where it is validating that
a mbuf pointer is in the sendmap. But a SYN and FIN often will not
have a mbuf pointer. So the fix is two fold a) make sure that the
SYN and FIN split the right way when cloning an RSM SYN on left
edge and FIN on right. And also make sure the KASSERT properly
accounts for the case that we have a SYN or FIN so we don't
panic.

Reviewed by: mtuexen
Sponsored by: Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D30241
2021-05-13 07:36:04 -04:00
Michael Tuexen
eec6aed5b8 sctp: fix another locking bug in COOKIE handling
Thanks to Tolya Korniltsev for reporting the issue for
the userland stack and testing the fix.

MFC after:	3 days
2021-05-12 23:05:28 +02:00
Mark Johnston
d8acd2681b Fix mbuf leaks in various pru_send implementations
The various protocol implementations are not very consistent about
freeing mbufs in error paths.  In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.

This diff plugs various leaks in the pru_send implementations.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30151
2021-05-12 13:00:09 -04:00
Michael Tuexen
251842c639 tcp rack: improve initialisation of retransmit timeout
When the TCP is in the front states, don't take the slop variable
into account. This improves consistency with the base stack.

Reviewed by:		rrs@
Differential Revision:	https://reviews.freebsd.org/D30230
MFC after:		1 week
Sponsored by:		Netflix, Inc.
2021-05-12 18:02:21 +02:00
Michael Tuexen
12dda000ed sctp: fix locking in case of error handling during a restart
Thanks to Taylor Brandstetter for finding the issue and providing
a patch for the userland stack.

MFC after:	3 days
2021-05-12 15:29:06 +02:00
Randall Stewart
4b86a24a76 tcp: In rack, we must only convert restored rtt when the hostcache does restore them.
Rack now after the previous commit is very careful to translate any
value in the hostcache for srtt/rttvar into its proper format. However
there is a snafu here in that if tp->srtt is 0 is the only time that
the HC will actually restore the srtt. We need to then only convert
the srtt restored when it is actually restored. We do this by making
sure it was zero before the call to cc_conn_init and it is non-zero
afterwards.

Reviewed by:	Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30213
2021-05-11 08:15:05 -04:00
Wojciech Macek
0b103f7237 mrouter: do not loopback packets unconditionally
Looping back router multicast traffic signifficantly
stresses network stack. Add possibility to disable or enable
loopbacked based on sysctl value.

Reported by:    Daniel Deville
Reviewed by:	mw
Differential Revision:	https://reviews.freebsd.org/D29947
2021-05-11 12:36:07 +02:00
Wojciech Macek
65634ae748 mroute: fix race condition during mrouter shutting down
There is a race condition between V_ip_mrouter de-init
    and ip_mforward handling. It might happen that mrouted
    is cleaned up after V_ip_mrouter check and before
    processing packet in ip_mforward.
    Use epoch call aproach, similar to IPSec which also handles
    such case.

Reported by:    Damien Deville
Obtained from:	Stormshield
Reviewed by:	mw
Differential Revision:	https://reviews.freebsd.org/D29946
2021-05-11 12:34:20 +02:00
Richard Scheffenegger
0471a8c734 tcp: SACK Lost Retransmission Detection (LRD)
Recover from excessive losses without reverting to a
retransmission timeout (RTO). Disabled by default, enable
with sysctl net.inet.tcp.do_lrd=1

Reviewed By: #transport, rrs, tuexen, #manpages
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D28931
2021-05-10 19:06:20 +02:00
Randall Stewart
9867224bab tcp:Host cache and rack ending up with incorrect values.
The hostcache up to now as been updated in the discard callback
but without checking if we are all done (the race where there are
more than one calls and the counter has not yet reached zero). This
means that when the race occurs, we end up calling the hc_upate
more than once. Also alternate stacks can keep there srtt/rttvar
in different formats (example rack keeps its values in microseconds).
Since we call the hc_update *before* the stack fini() then the
values will be in the wrong format.

Rack on the other hand, needs to convert items pulled from the
hostcache into its internal format else it may end up with
very much incorrect values from the hostcache. In the process
lets commonize the update mechanism for srtt/rttvar since we
now have more than one place that needs to call it.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30172
2021-05-10 11:25:51 -04:00
Randall Stewart
5a4333a537 This takes Warners suggested approach to making it so that
platforms that for whatever reason cannot include the RATELIMIT option
can still work with rack. It adds two dummy functions that rack will
call and find out that the highest hw supported b/w is 0 (which
kinda makes sense and rack is already prepared to handle).

Reviewed by: Michael Tuexen, Warner Losh
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30163
2021-05-07 17:32:32 -04:00
Mark Johnston
a1fadf7de2 divert: Fix mbuf ownership confusion in div_output()
div_output_outbound() and div_output_inbound() relied on the caller to
free the mbuf if an error occurred.  However, this is contrary to the
semantics of their callees, ip_output(), ip6_output() and
netisr_queue_src(), which always consume the mbuf.  So, if one of these
functions returned an error, that would get propagated up to
div_output(), resulting in a double free.

Fix the problem by making div_output_outbound() and div_output_inbound()
responsible for freeing the mbuf in all cases.

Reported by:	Michael Schmiedgen <schmiedgen@gmx.net>
Tested by:	Michael Schmiedgen
Reviewed by:	donner
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30129
2021-05-07 14:31:08 -04:00
Randall Stewart
a16cee0218 Fix a UDP tunneling issue with rack. Basically there are two
issues.
A) Not enough hdrlen was being calculated when a UDP tunnel is
   in place.
and
B) Not enough memory is allocated in racks fsb. We need to
   overbook the fsb to include a udphdr just in case.

Submitted by: Peter Lei
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30157
2021-05-07 14:06:43 -04:00
Gleb Smirnoff
be578b67b5 tcp_twcheck(): use correct unlock macro.
This crippled in due to conflict between two last commits 1db08fbe3f
and 9e644c2300.

Submitted by:	Peter Lei
2021-05-06 10:19:21 -07:00
Randall Stewart
5d8fd932e4 This brings into sync FreeBSD with the netflix versions of rack and bbr.
This fixes several breakages (panics) since the tcp_lro code was
committed that have been reported. Quite a few new features are
now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents
comming soon on rack..

Sponsored by:           Netflix
Reviewed by:		rscheff, mtuexen
Differential Revision:	https://reviews.freebsd.org/D30036
2021-05-06 11:22:26 -04:00
Michael Tuexen
d1cb8d11b0 sctp: improve consistency when handling chunks of wrong size
MFC after:	3 days
2021-05-06 01:02:41 +02:00
Mark Johnston
6c34dde83e igmp: Avoid an out-of-bounds access when zeroing counters
When verifying, byte-by-byte, that the user-supplied counters are
zero-filled, sysctl_igmp_stat() would check for zero before checking the
loop bound.  Perform the checks in the correct order.

Reported by:	KASAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-05-05 17:12:51 -04:00
Marko Zec
2aca58e16f Introduce DXR as an IPv4 longest prefix matching / FIB module
DXR maintains compressed lookup structures with a trivial search
procedure.  A two-stage trie is indexed by the more significant bits of
the search key (IPv4 address), while the remaining bits are used for
finding the next hop in a sorted array.  The tradeoff between memory
footprint and search speed depends on the split between the trie and
the remaining binary search.  The default of 20 bits of the key being
used for trie indexing yields good performance (see below) with
footprints of around 2.5 Bytes per prefix with current BGP snapshots.

Rebuilding lookup structures takes some time, which is compensated for by
batching several RIB change requests into a single FIB update, i.e. FIB
synchronization with the RIB may be delayed for a fraction of a second.
RIB to FIB synchronization, next-hop table housekeeping, and lockless
lookup capability is provided by the FIB_ALGO infrastructure.

DXR works well on modern CPUs with several MBytes of caches, especially
in VMs, where is outperforms other currently available IPv4 FIB
algorithms by a large margin.

Synthetic single-thread LPM throughput test method:

kldload test_lookup; kldload dpdk_lpm4; kldload fib_dxr
sysctl net.route.test.run_lps_rnd=N
sysctl net.route.test.run_lps_seq=N

where N is the number of randomly generated keys (IPv4 addresses) which
should be chosen so that each test iteration runs for several seconds.

Each reported score represents the best of three runs, in million
lookups per second (MLPS), for two bechmarks (RND & SEQ) with two FIBs:

host: single interface address, local subnet route + default route
BGP: snapshot from linx.routeviews.org, 887957 prefixes, 496 next hops

Bhyve VM on an Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             40.6         20.2         N/A         N/A
radix4                7.8          3.8         1.2         0.6
radix4_lockless      18.0          9.0         1.6         0.8
dpdk_lpm4            14.4          5.0        14.6         5.0
dxr                  70.3         34.7        43.0        19.5

Intel(R) Core(TM) i5-5300U CPU @ 2.30 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             47.0         23.1         N/A         N/A
radix4                8.5          4.2         1.9         1.0
radix4_lockless      19.2          9.5         2.5         1.2
dpdk_lpm4            31.2          9.4        31.6         9.3
dxr                  84.9         41.4        51.7        23.6

Intel(R) Core(TM) i7-4771 CPU @ 3.50 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             59.5         29.4         N/A         N/A
radix4               10.8          5.5         2.5         1.3
radix4_lockless      24.7         12.0         3.1         1.6
dpdk_lpm4            29.1          9.0        30.2         9.1
dxr                 101.3         49.9        69.8        32.5

AMD Ryzen 7 3700X 8-Core Processor @ 3.60 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             70.8         35.4         N/A         N/A
radix4               14.4          7.2         2.8         1.4
radix4_lockless      30.2         15.1         3.7         1.8
dpdk_lpm4            29.9          9.0        30.0         8.9
dxr                 163.3         81.5        99.5        44.4

AMD Ryzen 5 5600X 6-Core Processor @ 3.70 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             93.6         46.7         N/A         N/A
radix4               18.9          9.3         4.3         2.1
radix4_lockless      37.2         18.6         5.3         2.7
dpdk_lpm4            51.8         15.1        51.6        14.9
dxr                 218.2        103.3       114.0        49.0

Reviewed by:	melifaro
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D29821
2021-05-05 13:45:52 +02:00
Michael Tuexen
b621fbb1bf sctp: drop packet with SHUTDOWN-ACK chunks with wrong vtags
MFC after:	3 days
2021-05-04 18:43:31 +02:00
Mark Johnston
f161d294b9 Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input.  In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur.  Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.

Reported by:	syzkaller+KASAN
Reviewed by:	tuexen, melifaro
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29519
2021-05-03 13:35:19 -04:00
Michael Tuexen
8b3d0f6439 sctp: improve address list scanning
If the alternate address has to be removed, force the stack to
find a new one, if it is still needed.

MFC after:	3 days
2021-05-03 02:50:05 +02:00
Michael Tuexen
a89481d328 sctp: improve restart handling
This fixes in particular a possible use after free bug reported
Anatoly Korniltsev and Taylor Brandstetter for the userland stack.

MFC after:	3 days
2021-05-03 02:20:24 +02:00
Alexander Motin
655c200cc8 Fix build after 5f2e183505. 2021-05-02 20:07:38 -04:00
Michael Tuexen
5f2e183505 sctp: improve error handling in INIT/INIT-ACK processing
When processing INIT and INIT-ACK information, also during
COOKIE processing, delete the current association, when it
would end up in an inconsistent state.

MFC after:	3 days
2021-05-02 22:41:35 +02:00
Michael Tuexen
e010d20032 sctp: update the vtag for INIT and INIT-ACK chunks
This is needed in case of responding with an ABORT to an INIT-ACK.
2021-04-30 13:33:16 +02:00
Michael Tuexen
eb79855920 sctp: fix SCTP_PEER_ADDR_PARAMS socket option
Ignore spp_pathmtu if it is 0, when setting the IPPROTO_SCTP level
socket option SCTP_PEER_ADDR_PARAMS as required by RFC 6458.

MFC after:	1 week
2021-04-30 12:31:09 +02:00
Michael Tuexen
eecdf5220b sctp: use RTO.Initial of 1 second as specified in RFC 4960bis 2021-04-30 00:45:56 +02:00
Michael Tuexen
9de7354bb8 sctp: improve consistency in handling chunks with wrong size
Just skip the chunk, if no other handling is required by the
specification.
2021-04-28 18:11:06 +02:00
Richard Scheffenegger
48be5b976e tcp: stop spurious rescue retransmissions and potential asserts
Reported by: pho@
MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29970
2021-04-28 15:01:10 +02:00
Michael Tuexen
059ec2225c sctp: cleanup verification of INIT and INIT-ACK chunks 2021-04-27 12:45:43 +02:00
Michael Tuexen
c70d1ef15d sctp: improve handling of illegal packets containing INIT chunks
Stop further processing of a packet when detecting that it
contains an INIT chunk, which is too small or is not the only
chunk in the packet. Still allow to finish the processing
of chunks before the INIT chunk.

Thanks to Antoly Korniltsev and Taylor Brandstetter for reporting
an issue with the userland stack, which made me aware of this
issue.

MFC after:	3 days
2021-04-26 10:43:58 +02:00
Michael Tuexen
163153c2a0 sctp: small cleanup, no functional change
MFC:		3 days
2021-04-26 02:56:48 +02:00
Hans Petter Selasky
a9b66dbd91 Allow the tcp_lro_flush_all() function to be called when the control
structure is zeroed, by setting the VNET after checking the mbuf count
for zero. It appears there are some cases with early interrupts on some
network devices which still trigger page-faults on accessing a NULL "ifp"
pointer before the TCP LRO control structure has been initialized.
This basically preserves the old behaviour, prior to
9ca874cf74 .

No functional change.

Reported by:	rscheff@
Differential Revision:	https://reviews.freebsd.org/D29564
MFC after:	2 weeks
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-04-24 12:23:42 +02:00
Mark Johnston
8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Navdeep Parhar
01d74fe1ff Path MTU discovery hooks for offloaded TCP connections.
Notify the TOE driver when when an ICMP type 3 code 4 (Fragmentation
needed and DF set) message is received for an offloaded connection.
This gives the driver an opportunity to lower the path MTU for the
connection and resume transmission, much like what the kernel does for
the connections that it handles.

Reviewed by:	glebius@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D29755
2021-04-21 13:00:16 -07:00
Mark Johnston
652908599b Add required checks for unmapped mbufs in ipdivert and ipfw
Also add an M_ASSERTMAPPED() macro to verify that all mbufs in the chain
are mapped.  Use it in ipfw_nat, which operates on a chain returned by
m_megapullup().

PR:		255164
Reviewed by:	ae, gallatin
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29838
2021-04-21 15:47:05 -04:00
Gleb Smirnoff
d554522f6e tcp_hostcache: use SMR for lookups, mutex(9) for updates.
In certain cases, e.g. a SYN-flood from a limited set of hosts,
the TCP hostcache becomes the main contention point. To solve
that, this change introduces lockless lookups on the hostcache.

The cache remains a hash, however buckets are now CK_SLIST. For
updates a bucket mutex is obtained, for read an SMR section is
entered.

Reviewed by:	markj, rscheff
Differential revision:	https://reviews.freebsd.org/D29729
2021-04-20 10:02:20 -07:00
Gleb Smirnoff
1db08fbe3f tcp_input: always request read-locking of PCB for any pure SYN segment.
This is further rework of 08d9c92027.  Now we carry the knowledge of
lock type all the way through tcp_input() and also into tcp_twcheck().
Ideally the rlocking for pure SYNs should propagate all the way into
the alternative TCP stacks, but not yet today.

This should close a race when socket is bind(2)-ed but not yet
listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
recently by pho@.
2021-04-20 10:02:20 -07:00
Gleb Smirnoff
7b5053ce22 tcp_input: remove comments and assertions about tcpbinfo locking
They aren't valid since d40c0d47cd.
2021-04-20 10:02:20 -07:00
Richard Scheffenegger
a649f1f6fd tcp: Deal with DSACKs, and adjust rescue hole on success.
When a rescue retransmission is successful, rather than
inserting new holes to the left of it, adjust the old
rescue entry to cover the missed sequence space.

Also, as snd_fack may be stale by that point, pull it forward
in order to never create a hole left of snd_una/th_ack.

Finally, with DSACKs, tcp_sack_doack() may be called
with new full ACKs but a DSACK block. Account for this
eventuality properly to keep sacked_bytes >= 0.

MFC after: 3 days
Reviewed By: kbowling, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29835
2021-04-20 14:54:28 +02:00
Hans Petter Selasky
9ca874cf74 Add TCP LRO support for VLAN and VxLAN.
This change makes the TCP LRO code more generic and flexible with regards
to supporting multiple different TCP encapsulation protocols and in general
lays the ground for broader TCP LRO support. The main job of the TCP LRO code is
to merge TCP packets for the same flow, to reduce the number of calls to upper
layers. This reduces CPU and increases performance, due to being able to send
larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it
possible to avoid per-packet interaction by the host CPU.

Because the current TCP LRO code was tightly bound and optimized for TCP/IP
over ethernet only, several larger changes were needed. Also a minor bug was
fixed in the flushing mechanism for inactive entries, where the expire time,
"le->mtime" was not always properly set.

To avoid having to re-run time consuming regression tests for every change,
it was chosen to squash the following list of changes into a single commit:
- Refactor parsing of all address information into the "lro_parser" structure.
  This easily allows to reuse parsing code for inner headers.
- Speedup header data comparison. Don't compare field by field, but
  instead use an unsigned long array, where the fields get packed.
- Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed
  recursivly, only applying deltas as the result of updating payload data.
- Make smaller inline functions doing one operation at a time instead of
  big functions having repeated code.
- Refactor the TCP ACK compression code to only execute once
  per TCP LRO flush. This gives a minor performance improvement and
  keeps the code simple.
- Use sbintime() for all time-keeping. This change also fixes flushing
  of inactive entries.
- Try to shrink the size of the LRO entry, because it is frequently zeroed.
- Removed unused TCP LRO macros.
- Cleanup unused TCP LRO statistics counters while at it.
- Try to use __predict_true() and predict_false() to optimise CPU branch
  predictions.

Bump the __FreeBSD_version due to changing the "lro_ctrl" structure.

Tested by:	Netflix
Reviewed by:	rrs (transport)
Differential Revision:	https://reviews.freebsd.org/D29564
MFC after:	2 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-04-20 13:36:22 +02:00
Gleb Smirnoff
faa9ad8a90 Fix off-by-one error in KASSERT from 02f26e98c7. 2021-04-19 17:20:19 -07:00
Richard Scheffenegger
b87cf2bc84 tcp: keep SACK scoreboard sorted when doing rescue retransmission
Reviewed By: tuexen, kbowling, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29825
2021-04-18 23:11:10 +02:00
Michael Tuexen
9e644c2300 tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D29469
2021-04-18 16:16:42 +02:00
Richard Scheffenegger
2e97826052 rack: Fix ECN on finalizing session.
Maintain code similarity between RACK and base stack
for ECN. This may not strictly be necessary, depending
when a state transition to FIN_WAIT_1 is done in RACK
after a shutdown() or close() syscall.

MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29658
2021-04-17 20:16:42 +02:00
Richard Scheffenegger
d1de2b05a0 tcp: Rename rfc6675_pipe to sack.revised, and enable by default
As full support of RFC6675 is in place, deprecating
net.inet.tcp.rfc6675_pipe and enabling by default
net.inet.tcp.sack.revised.

Reviewed By: #transport, kbowling, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28702
2021-04-17 14:59:45 +02:00
Gleb Smirnoff
86046cf55f tcp_respond(): fix assertion, should have been done in 08d9c92027. 2021-04-16 15:39:51 -07:00
Gleb Smirnoff
cb8d7c44d6 tcp_syncache: add net.inet.tcp.syncache.see_other sysctl
A security feature from c06f087ccb appeared to be a huge bottleneck
under SYN flood. To mitigate that add a sysctl that would make
syncache(4) globally visible, ignoring UID/GID, jail(2) and mac(4)
checks. When turned on, we won't need to call crhold() on the listening
socket credential for every incoming SYN packet.

Reviewed by:	bz
2021-04-15 15:26:48 -07:00
John Baldwin
774c4c82ff TOE: Use a read lock on the PCB for syncache_add().
Reviewed by:	np, glebius
Fixes:		08d9c92027
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D29739
2021-04-13 16:31:04 -07:00
Gleb Smirnoff
8d5719aa74 syncache: simplify syncache_add() KPI to return struct socket pointer
directly, not overwriting the listen socket pointer argument.
Not a functional change.
2021-04-12 08:27:40 -07:00
Gleb Smirnoff
08d9c92027 tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by:	rrs
Differential revision:	https://reviews.freebsd.org/D29576
2021-04-12 08:25:31 -07:00
Alexander V. Chernikov
c3a456defa Always use inp fib in the inp_lookup_mcast_ifp().
inp_lookup_mcast_ifp() is static and is only used in the inp_join_group().
The latter function is also static, and is only used in the inp_setmoptions(),
 which relies on inp being non-NULL.

As a result, in the current code, inp_lookup_mcast_ifp() is always called
 with non-NULL inp. Eliminate unused RT_DEFAULT_FIB condition and always
 use inp fib instead.

Differential Revision:	https://reviews.freebsd.org/D29594
Reviewed by:		kp
MFC after:		2 weeks
2021-04-10 13:47:49 +00:00
Gleb Smirnoff
1a7fe55ab8 tcp_hostcache: make THC_LOCK/UNLOCK macros to work with hash head pointer.
Not a functional change.
2021-04-09 14:07:35 -07:00
Gleb Smirnoff
4f49e3382f tcp_hostcache: style(9)
Reviewed by:	rscheff
2021-04-09 14:07:27 -07:00
Gleb Smirnoff
7c71f3bd6a tcp_hostcache: remove extraneous check.
All paths leading here already checked this setting.

Reviewed by:	rscheff
2021-04-09 14:07:19 -07:00
Gleb Smirnoff
0c25bf7e7c tcp_hostcache: implement tcp_hc_updatemtu() via tcp_hc_update.
Locking changes are planned here, and without this change too
much copy-and-paste would be between these two functions.

Reviewed by:	rscheff
2021-04-09 14:06:44 -07:00
Richard Scheffenegger
b878ec024b tcp: Use jenkins_hash32() in hostcache
As other parts of the base tcp stack (eg.
tcp fastopen) already use jenkins_hash32,
and the properties appear reasonably good,
switching to use that.

Reviewed By: tuexen, #transport, ae
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29515
2021-04-08 20:29:19 +02:00
Gleb Smirnoff
373ffc62c1 tcp_hostcache.c: remove unneeded includes.
Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
29acb54393 tcp_hostcache: add bool argument for tcp_hc_lookup() to tell are we
looking to only read from the result, or to update it as well.
For now doesn't affect locking, but allows to push stats and expire
update into single place.

Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
489bde5753 tcp_hostcache: hide rmx_hits/rmx_updates under ifdef.
They have little value unless you do some profiling investigations,
but they are performance bottleneck.

Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
2cca4c0ee0 Remove tcp_hostcache.h. Everything is private.
Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Richard Scheffenegger
90cca08e91 tcp: Prepare PRR to work with NewReno LossRecovery
Add proper PRR vnet declarations for consistency.
Also add pointer to tcpopt struct to tcp_do_prr_ack, in preparation
for it to deal with non-SACK window reduction (after loss).

No functional change.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29440
2021-04-08 19:16:31 +02:00
Richard Scheffenegger
9f2eeb0262 [tcp] Fix ECN on finalizing sessions.
A subtle oversight would subtly change new data packets
sent after a shutdown() or close() call, while the send
buffer is still draining.

MFC after: 3 days
Reviewed By: #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29616
2021-04-08 15:26:09 +02:00
Mark Johnston
274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Richard Scheffenegger
a04906f027 fix typo in 38ea2bd069 2021-04-02 20:34:33 +02:00
Richard Scheffenegger
38ea2bd069 Use sbuf_drain unconditionally
After making sbuf_drain safe for external use,
there is no need to protect the call.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29545
2021-04-02 20:27:46 +02:00
Richard Scheffenegger
9aef4e7c2b tcp: Shouldn't drain empty sbuf
MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29524
2021-04-01 17:18:38 +02:00
Richard Scheffenegger
02f26e98c7 tcp: Add hash histogram output and validate bucket length accounting
Provide a histogram output to check, if the hashsize or
bucketlimit could be optimized. Also add some basic sanity
checks around the accounting of the hash utilization.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29506
2021-04-01 14:44:14 +02:00
Richard Scheffenegger
529a2a0f27 tcp: For hostcache performance, use atomics instead of counters
As accessing the tcp hostcache happens frequently on some
classes of servers, it was recommended to use atomic_add/subtract
rather than (per-CPU distributed) counters, which have to be
summed up at high cost to cache efficiency.

PR: 254333
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Reviewed By: #transport, tuexen, jtl
Differential Revision: https://reviews.freebsd.org/D29522
2021-04-01 10:03:30 +02:00
Richard Scheffenegger
95e56d31e3 tcp: Make hostcache.cache_count MPSAFE by using a counter_u64_t
Addressing the underlying root cause for cache_count to
show unexpectedly high  values, by protecting all arithmetic on
that global variable by using counter(9).

PR:		254333
Reviewed By: tuexen, #transport
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29510
2021-03-31 20:24:13 +02:00
Richard Scheffenegger
869880463c tcp: drain tcp_hostcache_list in between per-bucket locks
Explicitly drain the sbuf after completing each hash bucket
to minimize the work performed while holding the hash
bucket lock.

PR:		254333
MFC after:	2 weeks
Reviewed By:	tuexen, jhb, #transport
Sponsored by: 	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29483
2021-03-31 19:24:21 +02:00
Andrey V. Elsukov
c80a4b76ce ipdivert: check that PCB is still valid after taking INPCB_RLOCK.
We are inspecting PCBs of divert sockets under NET_EPOCH section,
but PCB could be already detached and we should check INP_FREED flag
when we took INP_RLOCK.

PR:		254478
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29420
2021-03-30 12:31:09 +03:00