Commit Graph

6998 Commits

Author SHA1 Message Date
Mark Johnston
6c34dde83e igmp: Avoid an out-of-bounds access when zeroing counters
When verifying, byte-by-byte, that the user-supplied counters are
zero-filled, sysctl_igmp_stat() would check for zero before checking the
loop bound.  Perform the checks in the correct order.

Reported by:	KASAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-05-05 17:12:51 -04:00
Marko Zec
2aca58e16f Introduce DXR as an IPv4 longest prefix matching / FIB module
DXR maintains compressed lookup structures with a trivial search
procedure.  A two-stage trie is indexed by the more significant bits of
the search key (IPv4 address), while the remaining bits are used for
finding the next hop in a sorted array.  The tradeoff between memory
footprint and search speed depends on the split between the trie and
the remaining binary search.  The default of 20 bits of the key being
used for trie indexing yields good performance (see below) with
footprints of around 2.5 Bytes per prefix with current BGP snapshots.

Rebuilding lookup structures takes some time, which is compensated for by
batching several RIB change requests into a single FIB update, i.e. FIB
synchronization with the RIB may be delayed for a fraction of a second.
RIB to FIB synchronization, next-hop table housekeeping, and lockless
lookup capability is provided by the FIB_ALGO infrastructure.

DXR works well on modern CPUs with several MBytes of caches, especially
in VMs, where is outperforms other currently available IPv4 FIB
algorithms by a large margin.

Synthetic single-thread LPM throughput test method:

kldload test_lookup; kldload dpdk_lpm4; kldload fib_dxr
sysctl net.route.test.run_lps_rnd=N
sysctl net.route.test.run_lps_seq=N

where N is the number of randomly generated keys (IPv4 addresses) which
should be chosen so that each test iteration runs for several seconds.

Each reported score represents the best of three runs, in million
lookups per second (MLPS), for two bechmarks (RND & SEQ) with two FIBs:

host: single interface address, local subnet route + default route
BGP: snapshot from linx.routeviews.org, 887957 prefixes, 496 next hops

Bhyve VM on an Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             40.6         20.2         N/A         N/A
radix4                7.8          3.8         1.2         0.6
radix4_lockless      18.0          9.0         1.6         0.8
dpdk_lpm4            14.4          5.0        14.6         5.0
dxr                  70.3         34.7        43.0        19.5

Intel(R) Core(TM) i5-5300U CPU @ 2.30 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             47.0         23.1         N/A         N/A
radix4                8.5          4.2         1.9         1.0
radix4_lockless      19.2          9.5         2.5         1.2
dpdk_lpm4            31.2          9.4        31.6         9.3
dxr                  84.9         41.4        51.7        23.6

Intel(R) Core(TM) i7-4771 CPU @ 3.50 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             59.5         29.4         N/A         N/A
radix4               10.8          5.5         2.5         1.3
radix4_lockless      24.7         12.0         3.1         1.6
dpdk_lpm4            29.1          9.0        30.2         9.1
dxr                 101.3         49.9        69.8        32.5

AMD Ryzen 7 3700X 8-Core Processor @ 3.60 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             70.8         35.4         N/A         N/A
radix4               14.4          7.2         2.8         1.4
radix4_lockless      30.2         15.1         3.7         1.8
dpdk_lpm4            29.9          9.0        30.0         8.9
dxr                 163.3         81.5        99.5        44.4

AMD Ryzen 5 5600X 6-Core Processor @ 3.70 GHz:
inet.algo         host, RND    host, SEQ    BGP, RND    BGP, SEQ
bsearch4             93.6         46.7         N/A         N/A
radix4               18.9          9.3         4.3         2.1
radix4_lockless      37.2         18.6         5.3         2.7
dpdk_lpm4            51.8         15.1        51.6        14.9
dxr                 218.2        103.3       114.0        49.0

Reviewed by:	melifaro
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D29821
2021-05-05 13:45:52 +02:00
Michael Tuexen
b621fbb1bf sctp: drop packet with SHUTDOWN-ACK chunks with wrong vtags
MFC after:	3 days
2021-05-04 18:43:31 +02:00
Mark Johnston
f161d294b9 Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input.  In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur.  Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.

Reported by:	syzkaller+KASAN
Reviewed by:	tuexen, melifaro
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29519
2021-05-03 13:35:19 -04:00
Michael Tuexen
8b3d0f6439 sctp: improve address list scanning
If the alternate address has to be removed, force the stack to
find a new one, if it is still needed.

MFC after:	3 days
2021-05-03 02:50:05 +02:00
Michael Tuexen
a89481d328 sctp: improve restart handling
This fixes in particular a possible use after free bug reported
Anatoly Korniltsev and Taylor Brandstetter for the userland stack.

MFC after:	3 days
2021-05-03 02:20:24 +02:00
Alexander Motin
655c200cc8 Fix build after 5f2e183505. 2021-05-02 20:07:38 -04:00
Michael Tuexen
5f2e183505 sctp: improve error handling in INIT/INIT-ACK processing
When processing INIT and INIT-ACK information, also during
COOKIE processing, delete the current association, when it
would end up in an inconsistent state.

MFC after:	3 days
2021-05-02 22:41:35 +02:00
Michael Tuexen
e010d20032 sctp: update the vtag for INIT and INIT-ACK chunks
This is needed in case of responding with an ABORT to an INIT-ACK.
2021-04-30 13:33:16 +02:00
Michael Tuexen
eb79855920 sctp: fix SCTP_PEER_ADDR_PARAMS socket option
Ignore spp_pathmtu if it is 0, when setting the IPPROTO_SCTP level
socket option SCTP_PEER_ADDR_PARAMS as required by RFC 6458.

MFC after:	1 week
2021-04-30 12:31:09 +02:00
Michael Tuexen
eecdf5220b sctp: use RTO.Initial of 1 second as specified in RFC 4960bis 2021-04-30 00:45:56 +02:00
Michael Tuexen
9de7354bb8 sctp: improve consistency in handling chunks with wrong size
Just skip the chunk, if no other handling is required by the
specification.
2021-04-28 18:11:06 +02:00
Richard Scheffenegger
48be5b976e tcp: stop spurious rescue retransmissions and potential asserts
Reported by: pho@
MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29970
2021-04-28 15:01:10 +02:00
Michael Tuexen
059ec2225c sctp: cleanup verification of INIT and INIT-ACK chunks 2021-04-27 12:45:43 +02:00
Michael Tuexen
c70d1ef15d sctp: improve handling of illegal packets containing INIT chunks
Stop further processing of a packet when detecting that it
contains an INIT chunk, which is too small or is not the only
chunk in the packet. Still allow to finish the processing
of chunks before the INIT chunk.

Thanks to Antoly Korniltsev and Taylor Brandstetter for reporting
an issue with the userland stack, which made me aware of this
issue.

MFC after:	3 days
2021-04-26 10:43:58 +02:00
Michael Tuexen
163153c2a0 sctp: small cleanup, no functional change
MFC:		3 days
2021-04-26 02:56:48 +02:00
Hans Petter Selasky
a9b66dbd91 Allow the tcp_lro_flush_all() function to be called when the control
structure is zeroed, by setting the VNET after checking the mbuf count
for zero. It appears there are some cases with early interrupts on some
network devices which still trigger page-faults on accessing a NULL "ifp"
pointer before the TCP LRO control structure has been initialized.
This basically preserves the old behaviour, prior to
9ca874cf74 .

No functional change.

Reported by:	rscheff@
Differential Revision:	https://reviews.freebsd.org/D29564
MFC after:	2 weeks
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-04-24 12:23:42 +02:00
Mark Johnston
8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Navdeep Parhar
01d74fe1ff Path MTU discovery hooks for offloaded TCP connections.
Notify the TOE driver when when an ICMP type 3 code 4 (Fragmentation
needed and DF set) message is received for an offloaded connection.
This gives the driver an opportunity to lower the path MTU for the
connection and resume transmission, much like what the kernel does for
the connections that it handles.

Reviewed by:	glebius@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D29755
2021-04-21 13:00:16 -07:00
Mark Johnston
652908599b Add required checks for unmapped mbufs in ipdivert and ipfw
Also add an M_ASSERTMAPPED() macro to verify that all mbufs in the chain
are mapped.  Use it in ipfw_nat, which operates on a chain returned by
m_megapullup().

PR:		255164
Reviewed by:	ae, gallatin
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29838
2021-04-21 15:47:05 -04:00
Gleb Smirnoff
d554522f6e tcp_hostcache: use SMR for lookups, mutex(9) for updates.
In certain cases, e.g. a SYN-flood from a limited set of hosts,
the TCP hostcache becomes the main contention point. To solve
that, this change introduces lockless lookups on the hostcache.

The cache remains a hash, however buckets are now CK_SLIST. For
updates a bucket mutex is obtained, for read an SMR section is
entered.

Reviewed by:	markj, rscheff
Differential revision:	https://reviews.freebsd.org/D29729
2021-04-20 10:02:20 -07:00
Gleb Smirnoff
1db08fbe3f tcp_input: always request read-locking of PCB for any pure SYN segment.
This is further rework of 08d9c92027.  Now we carry the knowledge of
lock type all the way through tcp_input() and also into tcp_twcheck().
Ideally the rlocking for pure SYNs should propagate all the way into
the alternative TCP stacks, but not yet today.

This should close a race when socket is bind(2)-ed but not yet
listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
recently by pho@.
2021-04-20 10:02:20 -07:00
Gleb Smirnoff
7b5053ce22 tcp_input: remove comments and assertions about tcpbinfo locking
They aren't valid since d40c0d47cd.
2021-04-20 10:02:20 -07:00
Richard Scheffenegger
a649f1f6fd tcp: Deal with DSACKs, and adjust rescue hole on success.
When a rescue retransmission is successful, rather than
inserting new holes to the left of it, adjust the old
rescue entry to cover the missed sequence space.

Also, as snd_fack may be stale by that point, pull it forward
in order to never create a hole left of snd_una/th_ack.

Finally, with DSACKs, tcp_sack_doack() may be called
with new full ACKs but a DSACK block. Account for this
eventuality properly to keep sacked_bytes >= 0.

MFC after: 3 days
Reviewed By: kbowling, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29835
2021-04-20 14:54:28 +02:00
Hans Petter Selasky
9ca874cf74 Add TCP LRO support for VLAN and VxLAN.
This change makes the TCP LRO code more generic and flexible with regards
to supporting multiple different TCP encapsulation protocols and in general
lays the ground for broader TCP LRO support. The main job of the TCP LRO code is
to merge TCP packets for the same flow, to reduce the number of calls to upper
layers. This reduces CPU and increases performance, due to being able to send
larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it
possible to avoid per-packet interaction by the host CPU.

Because the current TCP LRO code was tightly bound and optimized for TCP/IP
over ethernet only, several larger changes were needed. Also a minor bug was
fixed in the flushing mechanism for inactive entries, where the expire time,
"le->mtime" was not always properly set.

To avoid having to re-run time consuming regression tests for every change,
it was chosen to squash the following list of changes into a single commit:
- Refactor parsing of all address information into the "lro_parser" structure.
  This easily allows to reuse parsing code for inner headers.
- Speedup header data comparison. Don't compare field by field, but
  instead use an unsigned long array, where the fields get packed.
- Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed
  recursivly, only applying deltas as the result of updating payload data.
- Make smaller inline functions doing one operation at a time instead of
  big functions having repeated code.
- Refactor the TCP ACK compression code to only execute once
  per TCP LRO flush. This gives a minor performance improvement and
  keeps the code simple.
- Use sbintime() for all time-keeping. This change also fixes flushing
  of inactive entries.
- Try to shrink the size of the LRO entry, because it is frequently zeroed.
- Removed unused TCP LRO macros.
- Cleanup unused TCP LRO statistics counters while at it.
- Try to use __predict_true() and predict_false() to optimise CPU branch
  predictions.

Bump the __FreeBSD_version due to changing the "lro_ctrl" structure.

Tested by:	Netflix
Reviewed by:	rrs (transport)
Differential Revision:	https://reviews.freebsd.org/D29564
MFC after:	2 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-04-20 13:36:22 +02:00
Gleb Smirnoff
faa9ad8a90 Fix off-by-one error in KASSERT from 02f26e98c7. 2021-04-19 17:20:19 -07:00
Richard Scheffenegger
b87cf2bc84 tcp: keep SACK scoreboard sorted when doing rescue retransmission
Reviewed By: tuexen, kbowling, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29825
2021-04-18 23:11:10 +02:00
Michael Tuexen
9e644c2300 tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D29469
2021-04-18 16:16:42 +02:00
Richard Scheffenegger
2e97826052 rack: Fix ECN on finalizing session.
Maintain code similarity between RACK and base stack
for ECN. This may not strictly be necessary, depending
when a state transition to FIN_WAIT_1 is done in RACK
after a shutdown() or close() syscall.

MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29658
2021-04-17 20:16:42 +02:00
Richard Scheffenegger
d1de2b05a0 tcp: Rename rfc6675_pipe to sack.revised, and enable by default
As full support of RFC6675 is in place, deprecating
net.inet.tcp.rfc6675_pipe and enabling by default
net.inet.tcp.sack.revised.

Reviewed By: #transport, kbowling, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28702
2021-04-17 14:59:45 +02:00
Gleb Smirnoff
86046cf55f tcp_respond(): fix assertion, should have been done in 08d9c92027. 2021-04-16 15:39:51 -07:00
Gleb Smirnoff
cb8d7c44d6 tcp_syncache: add net.inet.tcp.syncache.see_other sysctl
A security feature from c06f087ccb appeared to be a huge bottleneck
under SYN flood. To mitigate that add a sysctl that would make
syncache(4) globally visible, ignoring UID/GID, jail(2) and mac(4)
checks. When turned on, we won't need to call crhold() on the listening
socket credential for every incoming SYN packet.

Reviewed by:	bz
2021-04-15 15:26:48 -07:00
John Baldwin
774c4c82ff TOE: Use a read lock on the PCB for syncache_add().
Reviewed by:	np, glebius
Fixes:		08d9c92027
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D29739
2021-04-13 16:31:04 -07:00
Gleb Smirnoff
8d5719aa74 syncache: simplify syncache_add() KPI to return struct socket pointer
directly, not overwriting the listen socket pointer argument.
Not a functional change.
2021-04-12 08:27:40 -07:00
Gleb Smirnoff
08d9c92027 tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by:	rrs
Differential revision:	https://reviews.freebsd.org/D29576
2021-04-12 08:25:31 -07:00
Alexander V. Chernikov
c3a456defa Always use inp fib in the inp_lookup_mcast_ifp().
inp_lookup_mcast_ifp() is static and is only used in the inp_join_group().
The latter function is also static, and is only used in the inp_setmoptions(),
 which relies on inp being non-NULL.

As a result, in the current code, inp_lookup_mcast_ifp() is always called
 with non-NULL inp. Eliminate unused RT_DEFAULT_FIB condition and always
 use inp fib instead.

Differential Revision:	https://reviews.freebsd.org/D29594
Reviewed by:		kp
MFC after:		2 weeks
2021-04-10 13:47:49 +00:00
Gleb Smirnoff
1a7fe55ab8 tcp_hostcache: make THC_LOCK/UNLOCK macros to work with hash head pointer.
Not a functional change.
2021-04-09 14:07:35 -07:00
Gleb Smirnoff
4f49e3382f tcp_hostcache: style(9)
Reviewed by:	rscheff
2021-04-09 14:07:27 -07:00
Gleb Smirnoff
7c71f3bd6a tcp_hostcache: remove extraneous check.
All paths leading here already checked this setting.

Reviewed by:	rscheff
2021-04-09 14:07:19 -07:00
Gleb Smirnoff
0c25bf7e7c tcp_hostcache: implement tcp_hc_updatemtu() via tcp_hc_update.
Locking changes are planned here, and without this change too
much copy-and-paste would be between these two functions.

Reviewed by:	rscheff
2021-04-09 14:06:44 -07:00
Richard Scheffenegger
b878ec024b tcp: Use jenkins_hash32() in hostcache
As other parts of the base tcp stack (eg.
tcp fastopen) already use jenkins_hash32,
and the properties appear reasonably good,
switching to use that.

Reviewed By: tuexen, #transport, ae
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29515
2021-04-08 20:29:19 +02:00
Gleb Smirnoff
373ffc62c1 tcp_hostcache.c: remove unneeded includes.
Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
29acb54393 tcp_hostcache: add bool argument for tcp_hc_lookup() to tell are we
looking to only read from the result, or to update it as well.
For now doesn't affect locking, but allows to push stats and expire
update into single place.

Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
489bde5753 tcp_hostcache: hide rmx_hits/rmx_updates under ifdef.
They have little value unless you do some profiling investigations,
but they are performance bottleneck.

Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
2cca4c0ee0 Remove tcp_hostcache.h. Everything is private.
Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Richard Scheffenegger
90cca08e91 tcp: Prepare PRR to work with NewReno LossRecovery
Add proper PRR vnet declarations for consistency.
Also add pointer to tcpopt struct to tcp_do_prr_ack, in preparation
for it to deal with non-SACK window reduction (after loss).

No functional change.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29440
2021-04-08 19:16:31 +02:00
Richard Scheffenegger
9f2eeb0262 [tcp] Fix ECN on finalizing sessions.
A subtle oversight would subtly change new data packets
sent after a shutdown() or close() call, while the send
buffer is still draining.

MFC after: 3 days
Reviewed By: #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29616
2021-04-08 15:26:09 +02:00
Mark Johnston
274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Richard Scheffenegger
a04906f027 fix typo in 38ea2bd069 2021-04-02 20:34:33 +02:00
Richard Scheffenegger
38ea2bd069 Use sbuf_drain unconditionally
After making sbuf_drain safe for external use,
there is no need to protect the call.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29545
2021-04-02 20:27:46 +02:00
Richard Scheffenegger
9aef4e7c2b tcp: Shouldn't drain empty sbuf
MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29524
2021-04-01 17:18:38 +02:00
Richard Scheffenegger
02f26e98c7 tcp: Add hash histogram output and validate bucket length accounting
Provide a histogram output to check, if the hashsize or
bucketlimit could be optimized. Also add some basic sanity
checks around the accounting of the hash utilization.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29506
2021-04-01 14:44:14 +02:00
Richard Scheffenegger
529a2a0f27 tcp: For hostcache performance, use atomics instead of counters
As accessing the tcp hostcache happens frequently on some
classes of servers, it was recommended to use atomic_add/subtract
rather than (per-CPU distributed) counters, which have to be
summed up at high cost to cache efficiency.

PR: 254333
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Reviewed By: #transport, tuexen, jtl
Differential Revision: https://reviews.freebsd.org/D29522
2021-04-01 10:03:30 +02:00
Richard Scheffenegger
95e56d31e3 tcp: Make hostcache.cache_count MPSAFE by using a counter_u64_t
Addressing the underlying root cause for cache_count to
show unexpectedly high  values, by protecting all arithmetic on
that global variable by using counter(9).

PR:		254333
Reviewed By: tuexen, #transport
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29510
2021-03-31 20:24:13 +02:00
Richard Scheffenegger
869880463c tcp: drain tcp_hostcache_list in between per-bucket locks
Explicitly drain the sbuf after completing each hash bucket
to minimize the work performed while holding the hash
bucket lock.

PR:		254333
MFC after:	2 weeks
Reviewed By:	tuexen, jhb, #transport
Sponsored by: 	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29483
2021-03-31 19:24:21 +02:00
Andrey V. Elsukov
c80a4b76ce ipdivert: check that PCB is still valid after taking INPCB_RLOCK.
We are inspecting PCBs of divert sockets under NET_EPOCH section,
but PCB could be already detached and we should check INP_FREED flag
when we took INP_RLOCK.

PR:		254478
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29420
2021-03-30 12:31:09 +03:00
Richard Scheffenegger
cb0dd7e122 tcp: reduce memory footprint when listing tcp hostcache
In tcp_hostcache_list, the sbuf used would need a large (~2MB)
blocking allocation of memory (M_WAITOK), when listing a
full hostcache. This may stall the requestor for an indeterminate
time.

A further optimization is to return the expected userspace
buffersize right away, rather than preparing the output of
each current entry of the hostcase, provided by: @tuexen.

This makes use of the ready-made functions of sbuf to work
with sysctl, and repeatedly drain the much smaller buffer.

PR: 254333
MFC after: 2 weeks
Reviewed By: #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29471
2021-03-28 23:50:23 +02:00
Richard Scheffenegger
b9f803b7d4 tcp: Use PRR for ECN congestion recovery
MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28972
2021-03-26 02:06:15 +01:00
Richard Scheffenegger
eb3a59a831 tcp: Refactor PRR code
No functional change intended.

MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29411
2021-03-26 00:01:34 +01:00
Richard Scheffenegger
0533fab89e tcp: Perform simple fast retransmit when SACK Blocks are missing on SACK session
MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28634
2021-03-25 23:23:48 +01:00
Michael Tuexen
d995cc7e54 sctp: fix handling of RTO.initial of 1 ms
MFC after:	3 days
Reported by:	syzbot+5eb0e009147050056ce9@syzkaller.appspotmail.com
2021-03-22 16:44:18 +01:00
Michael Tuexen
40f41ece76 tcp: improve handling of SYN segments in SYN-SENT state
Ensure that the stack does not generate a DSACK block for user
data received on a SYN segment in SYN-SENT state.

Reviewed by:		rscheff
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D29376
Sponsored by:		Netflix, Inc.
2021-03-22 15:58:49 +01:00
Richard Scheffenegger
e9f029831f fix panic when rescue retransmission and FIN overlap
PR:           254244
PR:           254309
Reviewed By:  #transport, hselasky, tuexen
MFC after:    3 days
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29315
2021-03-17 17:12:04 +01:00
Gordon Bergling
5666643a95 Fix some common typos in comments
- occured -> occurred
- normaly -> normally
- controling -> controlling
- fileds -> fields
- insterted -> inserted
- outputing -> outputting

MFC after:	1 week
2021-03-13 18:26:15 +01:00
Gordon Bergling
183502d162 Fix a few typos in comments
- trough -> through

MFC after:	1 week
2021-03-13 16:37:28 +01:00
John Baldwin
5a50eb6585 Don't pass RFPROC to kproc_create(), it is redundant.
Reviewed by:	tuexen, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D29206
2021-03-12 09:48:10 -08:00
Alexander V. Chernikov
b1d63265ac Flush remaining routes from the routing table during VNET shutdown.
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
 VNET.

PR:	253998
Reported by:	rashey at superbox.pl
MFC after:	3 days

Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.

Test Plan:
```
set_skip:set_skip_group_lo  ->  passed  [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```

Reviewers: #network, kp, bz

Reviewed By: kp

Subscribers: imp, ae

Differential Revision: https://reviews.freebsd.org/D29116
2021-03-10 21:10:14 +00:00
Richard Scheffenegger
e53138694a tcp: Add prr_out in preparation for PRR/nonSACK and LRD
Reviewed By:           #transport, kbowling
MFC after:             3 days
Sponsored By:          Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D29058
2021-03-06 00:38:22 +01:00
Richard Scheffenegger
9a13d9dcee tcp: remove a superfluous local var in tcp_sack_partialack()
No functional change.

Reviewed By: #transport, tuexen
MFC after:   3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29088
2021-03-05 18:20:23 +01:00
Richard Scheffenegger
4a8f3aad37 tcp: remove incorrect reset of SACK variable in PRR
Reviewed By:   #transport, rrs, tuexen
PR:            253848
MFC after:     3 days
Sponsored By:  NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29083
2021-03-05 17:45:54 +01:00
Michael Tuexen
705d06b289 rack: unbreak TCP fast open for the client side
Allow sending user data on the SYN segment.

Reviewed by:		rrs
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D29082
Sponsored by:		Netflix, Inc.
2021-03-05 16:03:03 +01:00
Kristof Provost
bb4a7d94b9 net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros
Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.

Use them where appropriate.

Reviewed by:	ae (previous version), rscheff, tuexen, rgrimes
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29056
2021-03-04 20:56:48 +01:00
Michael Tuexen
99adf23006 RACK: fix an issue triggered by using the CDG CC module
Obtained from:		rrs@
MFC after:		3 days
PR:			238741
Sponsored by:		Netlix, Inc.
2021-03-02 12:32:16 +01:00
Richard Scheffenegger
0b0f8b359d calculate prr_out correctly when pipe < ssthresh
Reviewed By:	#transport, tuexen
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D28998
2021-03-01 16:26:05 +01:00
Richard Scheffenegger
e9071000c9 Improve PRR initial transmission timing
Reviewed By:	tuexen, #transport
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D28953
2021-02-28 15:46:54 +01:00
Michael Tuexen
70e95f0b69 sctp: avoid integer overflow when starting the HB timer
MFC after:	3 days
Reported by:	syzbot+14b9d7c3c64208fae62f@syzkaller.appspotmail.com
2021-02-27 23:27:30 +01:00
Richard Scheffenegger
9e83a6a556 Include new data sent in PRR calculation
Reviewed By:	#transport, kbowling
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D28941
2021-02-26 22:31:58 +01:00
Richard Scheffenegger
2593f858d7 A TCP server has to take into consideration, if TCP_NOOPT is preventing
the negotiation of TCP features. This affects most TCP options but
adherance to RFC7323 with the timestamp option will prevent a session
from getting established.

PR:	253576
Reviewed By:	tuexen, #transport
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28652
2021-02-25 19:12:20 +01:00
Richard Scheffenegger
31d7a27c6e PRR: Avoid accounting left-edge twice in partial ACK.
Reviewed By:	#transport, kbowling
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D28819
2021-02-25 18:37:47 +01:00
Richard Scheffenegger
48396dc779 Address two incorrect calculations and enhance readability of PRR code
- address second instance of cwnd potentially becoming zero
- fix sublte bug due to implicit int to uint typecase in max()
- fix bug due to typo in hand-coded CEILING() function by using howmany() macro
- use int instead of long, and add a missing long typecast
- replace if conditionals with easier to read imax/imin (as in pseudocode)

Reviewed By: #transport, kbowling
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28813
2021-02-25 18:32:04 +01:00
Kristof Provost
f3245be349 net: remove legacy in_addmulti()
Despite the comment to the contrary neither pf nor carp use
in_addmulti(). Nothing does, so get rid of it.

Carp stopped using it in 08b68b0e4c
(2011). It's unclear when pf stopped using it, but before
d6d3f01e0a (2012).

Reviewed by:	bz@, melifaro@
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D28918
2021-02-25 10:13:52 +01:00
Kristof Provost
c139b3c19b arp/nd: Cope with late calls to iflladdr_event
When tearing down vnet jails we can move an if_bridge out (as
part of the normal vnet_if_return()). This can, when it's clearing out
its list of member interfaces, change its link layer address.
That sends an iflladdr_event, but at that point we've already freed the
AF_INET/AF_INET6 if_afdata pointers.

In other words: when the iflladdr_event callbacks fire we can't assume
that ifp->if_afdata[AF_INET] will be set.

Reviewed by:	donner@, melifaro@
MFC after:	1 week
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D28860
2021-02-23 13:54:07 +01:00
Hans Petter Selasky
9febbc4541 Fix for natd(8) sending wrong sequence number after TCP retransmission,
terminating a TCP connection.

If a TCP packet must be retransmitted and the data length has changed in the
retransmitted packet, due to the internal workings of TCP, typically when ACK
packets are lost, then there is a 30% chance that the logic in GetDeltaSeqOut()
will find the correct length, which is the last length received.

This can be explained as follows:

If a "227 Entering Passive Mode" packet must be retransmittet and the length
changes from 51 to 50 bytes, for example, then we have three cases for the
list scan in GetDeltaSeqOut(), depending on how many prior packets were
received modulus N_LINK_TCP_DATA=3:

  case 1:  index 0:   original packet        51
           index 1:   retransmitted packet   50
           index 2:   not relevant

  case 2:  index 0:   not relevant
           index 1:   original packet        51
           index 2:   retransmitted packet   50

  case 3:  index 0:   retransmitted packet   50
           index 1:   not relevant
           index 2:   original packet        51

This patch simply changes the searching order for TCP packets, always starting
at the last received packet instead of any received packet, in
GetDeltaAckIn() and GetDeltaSeqOut().

Else no functional changes.

Discussed with:	rscheff@
Submitted by:	Andreas Longwitz <longwitz@incore.de>
PR:		230755
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-02-22 17:13:58 +01:00
Michael Tuexen
b963ce4588 sctp: improve computation of an alternate net
Espeially handle the case where the net passed in is about to
be deleted and therefore not in the list of nets anymore.

MFC after:	3 days
Reported by:	syzbot+9756917a7c8381adf5e8@syzkaller.appspotmail.com
2021-02-21 17:13:06 +01:00
Michael Tuexen
5ac839029d sctp: clear a pointer to a net which will be removed
MFC after:	3 days
2021-02-21 13:06:05 +01:00
Richard Scheffenegger
a8e431e153 PRR: use accurate rfc6675_pipe when enabled
Reviewed By: #transport, tuexen
MFC after:   2 weeks
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28816
2021-02-20 20:11:48 +01:00
Richard Scheffenegger
853fd7a2e3 Ensure cwnd doesn't shrink to zero with PRR
Under some circumstances, PRR may end up with a fully
collapsed cwnd when finalizing the loss recovery.

Reviewed By:	#transport, kbowling
Reported by:	Liang Tian
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D28780
2021-02-19 13:55:32 +01:00
Kyle Evans
4c0bef07be kern: net: remove TCP_LINGERTIME
TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in
exactly the same form that it appears here modulo slightly different
context.  It used to be the case that there was a single pr_usrreq
method with requests dispatched to it; these exact two lines appeared in
tcp_usrreq's PRU_ATTACH handling.

The only purpose of this that I can find is to cause surprising behavior
on accepted connections. Newly-created sockets will never hit these
paths as one cannot set SO_LINGER prior to socket(2). If SO_LINGER is
set on a listening socket and inherited, one would expect the timeout to
be inherited rather than changed arbitrarily like this -- noting that
SO_LINGER is nonsense on a listening socket beyond inheritance, since
they cannot be 'connected' by definition.

Neither Illumos nor Linux reset the timer like this based on testing and
inspection of Illumos, and testing of Linux.

Reviewed by:	rscheff, tuexen
Differential Revision:	https://reviews.freebsd.org/D28265
2021-02-18 22:36:01 -06:00
Randall Stewart
e13e4fa6c4 fix Navdeeps LINT_NOINET error. 2021-02-18 07:29:12 -05:00
Randall Stewart
0a4f851074 Fix another pesky missing #ifdef TCPHPTS 2021-02-18 01:27:30 -05:00
Randall Stewart
ab4fad4be1 Add ifdef TCPHPTS around build_ack_entry and do_bpf_and_csum to avoid
warnings when HPTS is not included

Thanks to Gary Jennejohn for pointing this out.
2021-02-17 12:49:42 -05:00
Randall Stewart
69a34e8d02 Update the LRO processing code so that we can support
a further CPU enhancements for compressed acks. These
are acks that are compressed into an mbuf. The transport
has to be aware of how to process these, and an upcoming
update to rack will do so. You need the rack changes
to actually test and validate these since if the transport
does not support mbuf compression, then the old code paths
stay in place. We do in this commit take out the concept
of logging if you don't have a lock (which was quite
dangerous and was only for some early debugging but has
been left in the code).

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28374
2021-02-17 10:41:01 -05:00
Alexander V. Chernikov
9fdbf7eef5 Make in_localip_more() fib-aware.
It fixes loopback route installation for the interfaces
 in the different fibs using the same prefix.

Reviewed By:	donner
PR:		189088
Differential Revision: https://reviews.freebsd.org/D28673
MFC after:	1 week
2021-02-16 20:00:46 +00:00
Richard Scheffenegger
3c40e1d52c update the SACK loss recovery to RFC6675, with the following new features:
- improved pipe calculation which does not degrade under heavy loss
- engaging in Loss Recovery earlier under adverse conditions
- Rescue Retransmission in case some of the trailing packets of a request got lost

All above changes are toggled with the sysctl "rfc6675_pipe" (disabled by default).

Reviewers:	#transport, tuexen, lstewart, slavash, jtl, hselasky, kib, rgrimes, chengc_netapp.com, thj, #manpages, kbowling, #netapp, rscheff
Reviewed By:	#transport
Subscribers:	imp, melifaro
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D18985
2021-02-16 13:08:37 +01:00
Alexander V. Chernikov
8268d82cff Remove per-packet ifa refcounting from IPv6 fast path.
Currently ip6_input() calls in6ifa_ifwithaddr() for
 every local packet, in order to check if the target ip
 belongs to the local ifa in proper state and increase
 its counters.

in6ifa_ifwithaddr() references found ifa.
With epoch changes, both `ip6_input()` and all other current callers
 of `in6ifa_ifwithaddr()` do not need this reference
 anymore, as epoch provides stability guarantee.

Given that, update `in6ifa_ifwithaddr()` to allow
 it to return ifa without referencing it, while preserving
 option for getting referenced ifa if so desired.

MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D28648
2021-02-15 22:33:12 +00:00
Michael Tuexen
ed782b9f5a tcp: improve behaviour when using TCP_NOOPT
Use ISS for SEG.SEQ when sending a SYN-ACK segment in response to
an SYN segment received in the SYN-SENT state on a socket having
the IPPROTO_TCP level socket option TCP_NOOPT enabled.

Reviewed by:		rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D28656
2021-02-14 12:16:57 +01:00
Andrey V. Elsukov
c6ded47d0b [udp] fix possible mbuf and lock leak in udp_input().
In error case we can leave `inp' locked, also we need to free
mbuf chain `m' in the same case. Release the lock and use `badunlocked'
label to exit with freed mbuf. Also modify UDP error statistic to
match the IPv6 code.

Remove redundant INP_RUNLOCK() from the `if (last == NULL)' block,
there are no ways to reach this point with locked `inp'.

Obtained from:	Yandex LLC
MFC after:	3 days
Sponsored by:	Yandex LLC
2021-02-11 12:08:41 +03:00
Alexander V. Chernikov
924d1c9a05 Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca.
2021-02-08 22:32:32 +00:00
Alexander V. Chernikov
4a01b854ca SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2021-02-08 21:42:20 +00:00
Neel Chauhan
a08cdb6cfb Allow setting alias port ranges in libalias and ipfw. This will allow a system
to be a true RFC 6598 NAT444 setup, where each network segment (e.g. user,
subnet) can have their own dedicated port aliasing ranges.

Reviewed by:		donner, kp
Approved by:		0mp (mentor), donner, kp
Differential Revision:	https://reviews.freebsd.org/D23450
2021-02-02 13:24:17 -08:00
Hans Petter Selasky
db46c0d0cb Fix LINT kernel builds after 1a714ff204 .
MFC after:	1 week
Discussed with:	rrs@
Differential Revision:  https://reviews.freebsd.org/D28357
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-02-01 14:24:15 +01:00
Michael Tuexen
bdd4630c9a sctp: small cleanup, no functional change intended.
MFC after:	3 days
2021-02-01 14:04:57 +01:00
Michael Tuexen
af885c57d6 sctp: improve input validation
Improve the handling of INIT chunks in specific szenarios and
report and appropriate error cause.
Thanks to Anatoly Korniltsev for reporting the issue for the
userland stack.

MFC after:	3 days
2021-01-31 23:46:53 +01:00
Michael Tuexen
8dc6a1edca sctp: fix a locking issue for old unordered data
Thanks to Anatoly Korniltsev for reporting the issue for the
userland stack.

MFC after:	3 days
2021-01-31 10:46:23 +01:00
Gleb Smirnoff
3f43ada98c Catch up with 6edfd179c8: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.
2021-01-29 11:46:24 -08:00
Randall Stewart
1a714ff204 This pulls over all the changes that are in the netflix
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.

    Sponsored by: Netflix Inc.
    Differential Revision:  https://reviews.freebsd.org/D28357
2021-01-28 11:53:05 -05:00
Hans Petter Selasky
093e723190 Add missing decrement of active ratelimit connections.
Reviewed by:	rrs@
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-01-26 18:00:21 +01:00
Hans Petter Selasky
85d8d30f9f Don't allow allocating a new send tag on an INP which is being torn down.
This fixes a potential send tag leak.

Reviewed by:	rrs@
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-01-26 18:00:02 +01:00
Richard Scheffenegger
6a376af0cd TCP PRR: Patch div/0 in tcp_prr_partialack
With clearing of recover_fs in bc7ee8e5bc, div/0
was observed while processing partial_acks.

Suspect that rewind of an erraneous RTO may be
causing this - with the above change, recover_fs
would no longer retained at the last calculated
value, and reset. But CC_RTO_ERR can reenable
IN_RECOVERY(), without setting this again.

Adding a safety net prior to the division in that
function, which I missed in D28114.
2021-01-26 16:06:32 +01:00
Richard Scheffenegger
84761f3df5 Adjust line length in tcp_prr_partialack
Summary:
Wrap lines before column 80 in new prr code checked in recently.

No functional changes.

Reviewers: tuexen, rrs, jtl, mm, kbowling, #transport

Reviewed By: tuexen, mm, #transport

Subscribers: imp, melifaro

Differential Revision: https://reviews.freebsd.org/D28329
2021-01-26 14:47:19 +01:00
Michael Tuexen
0f7573ffd6 sctp: fix PR-SCTP stats when adding addtional streams
MFC after:	1 week
2021-01-24 00:50:33 +01:00
Michael Tuexen
7a051c0a78 sctp: improve consistency
No functional change intended.

MFC:	1 week
2021-01-24 00:07:41 +01:00
Alexander V. Chernikov
130aebbab0 Further refactor IPv4 interface route creation.
* Fix bug with /32 aliases introduced in 81728a538d.
* Explicitly document business logic for IPv4 ifa routes.
* Remove remnants of rtinit()
* Deduplicate ifa->route prefix code by moving it into ia_getrtprefix()
* Deduplicate conditional check for ifa_maintain_loopback_route()  by
 moving into ia_need_loopback_route()
* Remove now-unused flags argument from in_addprefix().

Reviewed by:		donner
PR:			252883
Differential Revision:	https://reviews.freebsd.org/D28246
2021-01-21 21:48:49 +00:00
Richard Scheffenegger
bc7ee8e5bc Address panic with PRR due to missed initialization of recover_fs
Summary:
When using the base stack in conjunction with RACK, it appears that
infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh.

This leaves the recover flightsize (sackhint.recover_fs) uninitialized,
leading to a div/0 panic.

Address this by properly initializing the variable just prior to first
use, if it is not properly initialized.

In order to prevent stale information from a prior recovery to
negatively impact the PRR calculations in this event, also clear
recover_fs once loss recovery is finished.

Finally, improve the readability of the initialization of recover_fs
when t_dupacks == tcprexmtthresh by adjusting the indentation and
using the max(1, snd_nxt - snd_una) macro.

Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages

Reviewed By: rrs, kbowling, #transport

Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro

Differential Revision: https://reviews.freebsd.org/D28114
2021-01-20 12:06:34 +01:00
Alex Richardson
a81c165bce Require uint32_t alignment for ipfw_insn
There are many casts of this struct to uint32_t, so we also need to ensure
that it is sufficiently aligned to safely perform this cast on architectures
that don't allow unaligned accesses. This fixes lots of -Wcast-align warnings.

Reviewed By:	ae
Differential Revision: https://reviews.freebsd.org/D27879
2021-01-19 21:23:25 +00:00
Alex Richardson
be5972695f libalias: Fix remaining compiler warnings
This fixes some sign-compare warnings and adds a missing static to a
variable declaration.

Differential Revision: https://reviews.freebsd.org/D27883
2021-01-19 21:23:24 +00:00
Alex Richardson
bc596e5632 libalias: Fix -Wcast-align compiler warnings
This fixes -Wcast-align warnings caused by the underaligned `struct ip`.
This also silences them in the public functions by changing the function
signature from char * to void *. This is source and binary compatible and
avoids the -Wcast-align warning.

Reviewed By:	ae, gbe (manpages)
Differential Revision: https://reviews.freebsd.org/D27882
2021-01-19 21:23:24 +00:00
Alexander V. Chernikov
f879876721 Fix IPv4 fib bsearch4() lookup array construction.
Current code didn't properly handle the case with nested prefixes
 like 10.0.0.0/24 && 10.0.0.0/25.
2021-01-17 20:32:26 +00:00
Alexander V. Chernikov
81728a538d Split rtinit() into multiple functions.
rtinit[1]() is a function used to add or remove interface address prefix routes,
  similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
 in reality.

1) IPv6 code does not use it for the ifa routes. There is a separate layer,
  nd6_prelist_(), providing interface for maintaining interface routes. Its part,
  responsible for the actual route table interaction, mimics rtenty() code.

2) rtinit tries to combine multiple actions in the same function: constructing
  proper route attributes and handling iterations over multiple fibs, for the
  non-zero net.add_addr_allfibs use case. It notably increases the code complexity.

3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
 for p2p connections, host routes and p2p routes are handled in the same way.
 Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
 It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
 aliases.

4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
 complicating rib_action() implementation.

5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
 ifa messages in certain corner cases.

To address all these points, the following has been done:

* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
 dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
 is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
 responsible for the actual route notifications.

Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses

Differential revision:	https://reviews.freebsd.org/D28186
2021-01-16 22:42:41 +00:00
Michael Tuexen
d2b3ceddcc tcp: add sysctl to tolerate TCP segments missing timestamps
When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.

Reviewed by:		jtl@, rscheff@
Differential Revision:	https://reviews.freebsd.org/D28142
Sponsored by:		Netflix, Inc.
PR:			252449
MFC after:		1 week
2021-01-14 19:28:25 +01:00
Michael Tuexen
cc3c34859e tcp: fix handling of TCP RST segments missing timestamps
A TCP RST segment should be processed even it is missing TCP
timestamps.

Reported by:		dmgk@, kevans@
Reviewed by:		rscheff@, dmgk@
Sponsored by:		Netflix, Inc.
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D28143
2021-01-14 14:39:35 +01:00
Mateusz Guzik
6b3a9a0f3d Convert remaining cap_rights_init users to cap_rights_init_one
semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)
2021-01-12 13:16:10 +00:00
Alexander V. Chernikov
2defbe9f0e Use rn_match instead of doing indirect calls in fib_algo.
Relevant inet/inet6 code has the control over deciding what
 the RIB lookup function currently is. With that in mind,
 explicitly set it to the current value (rn_match) in the
 datapath lookups. This avoids cost on indirect call.

Differential Revision: https://reviews.freebsd.org/D28066
2021-01-11 23:30:35 +00:00
Alexander V. Chernikov
0da3f8c98d Bump amount of queued packets in for unresolved ARP/NDP entries to 16.
Currently default behaviour is to keep only 1 packet per unresolved entry.
Ability to queue more than one packet was added 10 years ago, in r215207,
 though the default value was kep intact.

Things have changed since that time. Systems tend to initiate multiple
 connections at once for a variety of reasons.
For example, recent kern/252278 bug report describe happy-eyeball DNS
 behaviour sending multiple requests to the DNS server.

The primary driver for upper value for the queue length determination is
 memory consumption. Remote actors should not be able to easily exhaust
 local memory by sending packets to unresolved arp/ND entries.

For now, bump value to 16 packets, to match Darwin implementation.

The proper approach would be to switch the limit to calculate memory
 consumption instead of packet count and limit based on memory.

We should MFC this with a variation of D22447.

Reviewers: #manpages, #network, bz, emaste

Reviewed By: emaste, gbe(doc), jilles(doc)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28068
2021-01-11 19:51:11 +00:00
Mark Johnston
501159696c igmp: Avoid leaking mbuf when source validation fails
PR:		252504
Submitted by:	Panagiotis Tsolakos <panagiotis.tsolakos@gmail.com>
MFC after:	3 days
2021-01-08 13:32:04 -05:00
Alexander V. Chernikov
d68cf57b7f Refactor rt_addrmsg() and rt_routemsg().
Summary:
* Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally.
* Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer.
* Refactor rtsock_routemsg(): avoid accessing rtentry fields directly.
* Simplify in_addprefix() by moving prefix search to a separate  function.

Reviewers: #network

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D28011
2021-01-07 19:38:19 +00:00
Michael Tuexen
a7aa5eea4f sctp: improve handling of aborted associations
Don't clear a flag, when the structure already has been freed.
Reported by:	syzbot+07667d16c96779c737b4@syzkaller.appspotmail.com
2021-01-01 15:59:10 +01:00
Alexander V. Chernikov
f733d9701b Fix default route handling in radix4_lockless algo.
Improve nexthop debugging.

Reported by:	Florian Smeets <flo at smeets.xyz>
2020-12-26 22:51:02 +00:00
Alexander V. Chernikov
f5baf8bb12 Add modular fib lookup framework.
This change introduces framework that allows to dynamically
 attach or detach longest prefix match (lpm) lookup algorithms
 to speed up datapath route tables lookups.

Framework takes care of handling initial synchronisation,
 route subscription, nhop/nhop groups reference and indexing,
 dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
 picking the best matching algorithm on-the-fly based on the
 amount of routes in the routing table.

Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.

The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
  small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
  tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
  tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)

Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
  radix4: ~20mpps
  radix4_lockless: ~24.8mpps
  bsearch4: ~69mpps
  dpdk_lpm4: ~67 mpps
700k routes:
  radix4_lockless: 3.3mpps
  dpdk_lpm4: 46mpps

IPv6:
8 routes:
  radix6_lockless: ~20mpps
  dpdk_lpm6: ~70mpps
100k routes:
  radix6_lockless: 13.9mpps
  dpdk_lpm6: 57mpps

Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)

Control:
Framwork adds the following runtime sysctls:

List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless

Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.

Differential Revision: https://reviews.freebsd.org/D27401
2020-12-25 11:33:17 +00:00
Michael Tuexen
0ec2ce0d32 Improve input validation for parameters in ASCONF and ASCONF-ACK chunks
Thanks to Tolya Korniltsev for drawing my attention to this part of the
code by reporting an issue for the userland stack.
2020-12-23 18:03:47 +01:00
Andrew Gallatin
a034518ac8 Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.

Reviewed by:	jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by:	Netfix
Differential Revision:	https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
Michael Tuexen
0066de1c4b Harden the handling of outgoing streams in case of an restart or INIT
collision. This avouds an out-of-bounce access in case the peer can
break the cookie signature. Thanks to Felix Wilhelm from Google for
reporting the issue.

MFC after:		1 week
2020-12-13 23:51:51 +00:00
Michael Tuexen
aa6db9a045 Clean up more resouces of an existing SCTP association in case of
a restart.

This fixes a use-after-free scenario, which was reported by Felix
Wilhelm from Google in case a peer is able to modify the cookie.
However, this can also be triggered by an assciation restart under
some specific conditions.

MFC after:		1 week
2020-12-12 22:23:45 +00:00
Richard Scheffenegger
0e1d7c25c5 Add TCP feature Proportional Rate Reduction (PRR) - RFC6937
PRR improves loss recovery and avoids RTOs in a wide range
of scenarios (ACK thinning) over regular SACK loss recovery.

PRR is disabled by default, enable by net.inet.tcp.do_prr = 1.
Performance may be impeded by token bucket rate policers at
the bottleneck, where net.inet.tcp.do_prr_conservate = 1
should be enabled in addition.

Submitted by:	Aris Angelogiannopoulos
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D18892
2020-12-04 11:29:27 +00:00
Alexander V. Chernikov
d1d941c5b9 Remove RADIX_MPATH config option.
ROUTE_MPATH is the new config option controlling new multipath routing
 implementation. Remove the last pieces of RADIX_MPATH-related code and
 the config option.

Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D27244
2020-11-29 19:43:33 +00:00
Alexander V. Chernikov
b712e3e343 Refactor fib4/fib6 functions.
No functional changes.

* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
 (fib<46>_lookup_rt()). These will be used in the control plane code
 requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
 This change simplifies the switch to alternative lookup implementations,
 which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys

Differential Revision:	https://reviews.freebsd.org/D27405
2020-11-29 13:41:49 +00:00
Michael Tuexen
75fcd27ac2 Fix two occurences of a typo in a comment introduced in r367530.
Reported by:		lstewart@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D27148
2020-11-23 10:13:56 +00:00
Alexander V. Chernikov
7511a63825 Refactor rib iterator functions.
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
 consistent and prepare for upcoming iterator optimizations.

Differential Revision:	https://reviews.freebsd.org/D27219
2020-11-22 20:21:10 +00:00
Michael Tuexen
47384244f9 Fix an issue I introuced in r367530: tcp_twcheck() can be called
with to == NULL for SYN segments. So don't assume tp != NULL.
Thanks to jhb@ for reporting and suggesting a fix.

PR:			250499
MFC after:		1 week
XMFC-with:		r367530
Sponsored by:		Netflix, Inc.
2020-11-20 13:00:28 +00:00
Ed Maste
360d1232ab ip_fastfwd: style(9) tidy for r367628
Discussed with:	gnn
MFC with:	r367628
2020-11-13 18:25:07 +00:00
George V. Neville-Neil
d65d6d5aa9 Followup pointed out by ae@ 2020-11-13 13:07:44 +00:00
George V. Neville-Neil
8ad114c082 An earlier commit effectively turned out the fast forwading path
due to its lack of support for ICMP redirects. The following commit
adds redirects to the fastforward path, again allowing for decent
forwarding performance in the kernel.

Reviewed by: ae, melifaro
Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate")
2020-11-12 21:58:47 +00:00
Michael Tuexen
283c76c7c3 RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
  the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
  for the timestamp option has not been negotiated.
This patch enforces the above.

PR:			250499
Reviewed by:		gnn, rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc
Differential Revision:	https://reviews.freebsd.org/D27148
2020-11-09 21:49:40 +00:00
Michael Tuexen
e597bae4ee Fix a potential use-after-free bug introduced in
https://svnweb.freebsd.org/changeset/base/363046

Thanks to Taylor Brandstetter for finding this issue using fuzz testing
and reporting it in https://github.com/sctplab/usrsctp/issues/547
2020-11-09 13:12:07 +00:00
Mitchell Horne
b02c4e5c78 igmp: convert igmpstat to use PCPU counters
Currently there is no locking done to protect this structure. It is
likely okay due to the low-volume nature of IGMP, but allows for
the possibility of underflow. This appears to be one of the only
holdouts of the conversion to counter(9) which was done for most
protocol stat structures around 2013.

This also updates the visibility of this stats structure so that it can
be consumed from elsewhere in the kernel, consistent with the vast
majority of VNET_PCPUSTAT structures.

Reviewed by:	kp
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27023
2020-11-08 18:49:23 +00:00
Richard Scheffenegger
4d0770f172 Prevent premature SACK block transmission during loss recovery
Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.

Reported by:	chengc_netapp.com
Reviewed by:	rrs, jtl
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24237
2020-11-08 18:47:05 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
John Baldwin
98d7a8d9cd Call m_snd_tag_rele() to free send tags.
Send tags are refcounted and if_snd_tag_free() is called by
m_snd_tag_rele() when the last reference is dropped on a send tag.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26995
2020-10-29 22:18:56 +00:00
John Baldwin
7552deb2a0 Remove an extra if_ref().
In r348254, if_snd_tag_alloc() routines were changed to bump the ifp
refcount via m_snd_tag_init().  This function wasn't in the tree at
the time and wasn't updated for the new semantics, so was still doing
a separate bump after if_snd_tag_alloc() returned.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26999
2020-10-29 22:16:59 +00:00
John Baldwin
aebfdc1fec Store the new send tag in the right place.
r350501 added the 'st' parameter, but did not pass it down to
if_snd_tag_alloc().

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26997
2020-10-29 22:14:34 +00:00