Commit Graph

6885 Commits

Author SHA1 Message Date
Michael Tuexen
059ec2225c sctp: cleanup verification of INIT and INIT-ACK chunks 2021-04-27 12:45:43 +02:00
Michael Tuexen
c70d1ef15d sctp: improve handling of illegal packets containing INIT chunks
Stop further processing of a packet when detecting that it
contains an INIT chunk, which is too small or is not the only
chunk in the packet. Still allow to finish the processing
of chunks before the INIT chunk.

Thanks to Antoly Korniltsev and Taylor Brandstetter for reporting
an issue with the userland stack, which made me aware of this
issue.

MFC after:	3 days
2021-04-26 10:43:58 +02:00
Michael Tuexen
163153c2a0 sctp: small cleanup, no functional change
MFC:		3 days
2021-04-26 02:56:48 +02:00
Hans Petter Selasky
a9b66dbd91 Allow the tcp_lro_flush_all() function to be called when the control
structure is zeroed, by setting the VNET after checking the mbuf count
for zero. It appears there are some cases with early interrupts on some
network devices which still trigger page-faults on accessing a NULL "ifp"
pointer before the TCP LRO control structure has been initialized.
This basically preserves the old behaviour, prior to
9ca874cf74 .

No functional change.

Reported by:	rscheff@
Differential Revision:	https://reviews.freebsd.org/D29564
MFC after:	2 weeks
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-04-24 12:23:42 +02:00
Mark Johnston
8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Navdeep Parhar
01d74fe1ff Path MTU discovery hooks for offloaded TCP connections.
Notify the TOE driver when when an ICMP type 3 code 4 (Fragmentation
needed and DF set) message is received for an offloaded connection.
This gives the driver an opportunity to lower the path MTU for the
connection and resume transmission, much like what the kernel does for
the connections that it handles.

Reviewed by:	glebius@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D29755
2021-04-21 13:00:16 -07:00
Mark Johnston
652908599b Add required checks for unmapped mbufs in ipdivert and ipfw
Also add an M_ASSERTMAPPED() macro to verify that all mbufs in the chain
are mapped.  Use it in ipfw_nat, which operates on a chain returned by
m_megapullup().

PR:		255164
Reviewed by:	ae, gallatin
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29838
2021-04-21 15:47:05 -04:00
Gleb Smirnoff
d554522f6e tcp_hostcache: use SMR for lookups, mutex(9) for updates.
In certain cases, e.g. a SYN-flood from a limited set of hosts,
the TCP hostcache becomes the main contention point. To solve
that, this change introduces lockless lookups on the hostcache.

The cache remains a hash, however buckets are now CK_SLIST. For
updates a bucket mutex is obtained, for read an SMR section is
entered.

Reviewed by:	markj, rscheff
Differential revision:	https://reviews.freebsd.org/D29729
2021-04-20 10:02:20 -07:00
Gleb Smirnoff
1db08fbe3f tcp_input: always request read-locking of PCB for any pure SYN segment.
This is further rework of 08d9c92027.  Now we carry the knowledge of
lock type all the way through tcp_input() and also into tcp_twcheck().
Ideally the rlocking for pure SYNs should propagate all the way into
the alternative TCP stacks, but not yet today.

This should close a race when socket is bind(2)-ed but not yet
listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
recently by pho@.
2021-04-20 10:02:20 -07:00
Gleb Smirnoff
7b5053ce22 tcp_input: remove comments and assertions about tcpbinfo locking
They aren't valid since d40c0d47cd.
2021-04-20 10:02:20 -07:00
Richard Scheffenegger
a649f1f6fd tcp: Deal with DSACKs, and adjust rescue hole on success.
When a rescue retransmission is successful, rather than
inserting new holes to the left of it, adjust the old
rescue entry to cover the missed sequence space.

Also, as snd_fack may be stale by that point, pull it forward
in order to never create a hole left of snd_una/th_ack.

Finally, with DSACKs, tcp_sack_doack() may be called
with new full ACKs but a DSACK block. Account for this
eventuality properly to keep sacked_bytes >= 0.

MFC after: 3 days
Reviewed By: kbowling, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29835
2021-04-20 14:54:28 +02:00
Hans Petter Selasky
9ca874cf74 Add TCP LRO support for VLAN and VxLAN.
This change makes the TCP LRO code more generic and flexible with regards
to supporting multiple different TCP encapsulation protocols and in general
lays the ground for broader TCP LRO support. The main job of the TCP LRO code is
to merge TCP packets for the same flow, to reduce the number of calls to upper
layers. This reduces CPU and increases performance, due to being able to send
larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it
possible to avoid per-packet interaction by the host CPU.

Because the current TCP LRO code was tightly bound and optimized for TCP/IP
over ethernet only, several larger changes were needed. Also a minor bug was
fixed in the flushing mechanism for inactive entries, where the expire time,
"le->mtime" was not always properly set.

To avoid having to re-run time consuming regression tests for every change,
it was chosen to squash the following list of changes into a single commit:
- Refactor parsing of all address information into the "lro_parser" structure.
  This easily allows to reuse parsing code for inner headers.
- Speedup header data comparison. Don't compare field by field, but
  instead use an unsigned long array, where the fields get packed.
- Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed
  recursivly, only applying deltas as the result of updating payload data.
- Make smaller inline functions doing one operation at a time instead of
  big functions having repeated code.
- Refactor the TCP ACK compression code to only execute once
  per TCP LRO flush. This gives a minor performance improvement and
  keeps the code simple.
- Use sbintime() for all time-keeping. This change also fixes flushing
  of inactive entries.
- Try to shrink the size of the LRO entry, because it is frequently zeroed.
- Removed unused TCP LRO macros.
- Cleanup unused TCP LRO statistics counters while at it.
- Try to use __predict_true() and predict_false() to optimise CPU branch
  predictions.

Bump the __FreeBSD_version due to changing the "lro_ctrl" structure.

Tested by:	Netflix
Reviewed by:	rrs (transport)
Differential Revision:	https://reviews.freebsd.org/D29564
MFC after:	2 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-04-20 13:36:22 +02:00
Gleb Smirnoff
faa9ad8a90 Fix off-by-one error in KASSERT from 02f26e98c7. 2021-04-19 17:20:19 -07:00
Richard Scheffenegger
b87cf2bc84 tcp: keep SACK scoreboard sorted when doing rescue retransmission
Reviewed By: tuexen, kbowling, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29825
2021-04-18 23:11:10 +02:00
Michael Tuexen
9e644c2300 tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D29469
2021-04-18 16:16:42 +02:00
Richard Scheffenegger
2e97826052 rack: Fix ECN on finalizing session.
Maintain code similarity between RACK and base stack
for ECN. This may not strictly be necessary, depending
when a state transition to FIN_WAIT_1 is done in RACK
after a shutdown() or close() syscall.

MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29658
2021-04-17 20:16:42 +02:00
Richard Scheffenegger
d1de2b05a0 tcp: Rename rfc6675_pipe to sack.revised, and enable by default
As full support of RFC6675 is in place, deprecating
net.inet.tcp.rfc6675_pipe and enabling by default
net.inet.tcp.sack.revised.

Reviewed By: #transport, kbowling, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28702
2021-04-17 14:59:45 +02:00
Gleb Smirnoff
86046cf55f tcp_respond(): fix assertion, should have been done in 08d9c92027. 2021-04-16 15:39:51 -07:00
Gleb Smirnoff
cb8d7c44d6 tcp_syncache: add net.inet.tcp.syncache.see_other sysctl
A security feature from c06f087ccb appeared to be a huge bottleneck
under SYN flood. To mitigate that add a sysctl that would make
syncache(4) globally visible, ignoring UID/GID, jail(2) and mac(4)
checks. When turned on, we won't need to call crhold() on the listening
socket credential for every incoming SYN packet.

Reviewed by:	bz
2021-04-15 15:26:48 -07:00
John Baldwin
774c4c82ff TOE: Use a read lock on the PCB for syncache_add().
Reviewed by:	np, glebius
Fixes:		08d9c92027
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D29739
2021-04-13 16:31:04 -07:00
Gleb Smirnoff
8d5719aa74 syncache: simplify syncache_add() KPI to return struct socket pointer
directly, not overwriting the listen socket pointer argument.
Not a functional change.
2021-04-12 08:27:40 -07:00
Gleb Smirnoff
08d9c92027 tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by:	rrs
Differential revision:	https://reviews.freebsd.org/D29576
2021-04-12 08:25:31 -07:00
Alexander V. Chernikov
c3a456defa Always use inp fib in the inp_lookup_mcast_ifp().
inp_lookup_mcast_ifp() is static and is only used in the inp_join_group().
The latter function is also static, and is only used in the inp_setmoptions(),
 which relies on inp being non-NULL.

As a result, in the current code, inp_lookup_mcast_ifp() is always called
 with non-NULL inp. Eliminate unused RT_DEFAULT_FIB condition and always
 use inp fib instead.

Differential Revision:	https://reviews.freebsd.org/D29594
Reviewed by:		kp
MFC after:		2 weeks
2021-04-10 13:47:49 +00:00
Gleb Smirnoff
1a7fe55ab8 tcp_hostcache: make THC_LOCK/UNLOCK macros to work with hash head pointer.
Not a functional change.
2021-04-09 14:07:35 -07:00
Gleb Smirnoff
4f49e3382f tcp_hostcache: style(9)
Reviewed by:	rscheff
2021-04-09 14:07:27 -07:00
Gleb Smirnoff
7c71f3bd6a tcp_hostcache: remove extraneous check.
All paths leading here already checked this setting.

Reviewed by:	rscheff
2021-04-09 14:07:19 -07:00
Gleb Smirnoff
0c25bf7e7c tcp_hostcache: implement tcp_hc_updatemtu() via tcp_hc_update.
Locking changes are planned here, and without this change too
much copy-and-paste would be between these two functions.

Reviewed by:	rscheff
2021-04-09 14:06:44 -07:00
Richard Scheffenegger
b878ec024b tcp: Use jenkins_hash32() in hostcache
As other parts of the base tcp stack (eg.
tcp fastopen) already use jenkins_hash32,
and the properties appear reasonably good,
switching to use that.

Reviewed By: tuexen, #transport, ae
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29515
2021-04-08 20:29:19 +02:00
Gleb Smirnoff
373ffc62c1 tcp_hostcache.c: remove unneeded includes.
Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
29acb54393 tcp_hostcache: add bool argument for tcp_hc_lookup() to tell are we
looking to only read from the result, or to update it as well.
For now doesn't affect locking, but allows to push stats and expire
update into single place.

Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
489bde5753 tcp_hostcache: hide rmx_hits/rmx_updates under ifdef.
They have little value unless you do some profiling investigations,
but they are performance bottleneck.

Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Gleb Smirnoff
2cca4c0ee0 Remove tcp_hostcache.h. Everything is private.
Reviewed by:	rscheff
2021-04-08 10:58:44 -07:00
Richard Scheffenegger
90cca08e91 tcp: Prepare PRR to work with NewReno LossRecovery
Add proper PRR vnet declarations for consistency.
Also add pointer to tcpopt struct to tcp_do_prr_ack, in preparation
for it to deal with non-SACK window reduction (after loss).

No functional change.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29440
2021-04-08 19:16:31 +02:00
Richard Scheffenegger
9f2eeb0262 [tcp] Fix ECN on finalizing sessions.
A subtle oversight would subtly change new data packets
sent after a shutdown() or close() call, while the send
buffer is still draining.

MFC after: 3 days
Reviewed By: #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29616
2021-04-08 15:26:09 +02:00
Mark Johnston
274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Richard Scheffenegger
a04906f027 fix typo in 38ea2bd069 2021-04-02 20:34:33 +02:00
Richard Scheffenegger
38ea2bd069 Use sbuf_drain unconditionally
After making sbuf_drain safe for external use,
there is no need to protect the call.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29545
2021-04-02 20:27:46 +02:00
Richard Scheffenegger
9aef4e7c2b tcp: Shouldn't drain empty sbuf
MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29524
2021-04-01 17:18:38 +02:00
Richard Scheffenegger
02f26e98c7 tcp: Add hash histogram output and validate bucket length accounting
Provide a histogram output to check, if the hashsize or
bucketlimit could be optimized. Also add some basic sanity
checks around the accounting of the hash utilization.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29506
2021-04-01 14:44:14 +02:00
Richard Scheffenegger
529a2a0f27 tcp: For hostcache performance, use atomics instead of counters
As accessing the tcp hostcache happens frequently on some
classes of servers, it was recommended to use atomic_add/subtract
rather than (per-CPU distributed) counters, which have to be
summed up at high cost to cache efficiency.

PR: 254333
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Reviewed By: #transport, tuexen, jtl
Differential Revision: https://reviews.freebsd.org/D29522
2021-04-01 10:03:30 +02:00
Richard Scheffenegger
95e56d31e3 tcp: Make hostcache.cache_count MPSAFE by using a counter_u64_t
Addressing the underlying root cause for cache_count to
show unexpectedly high  values, by protecting all arithmetic on
that global variable by using counter(9).

PR:		254333
Reviewed By: tuexen, #transport
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29510
2021-03-31 20:24:13 +02:00
Richard Scheffenegger
869880463c tcp: drain tcp_hostcache_list in between per-bucket locks
Explicitly drain the sbuf after completing each hash bucket
to minimize the work performed while holding the hash
bucket lock.

PR:		254333
MFC after:	2 weeks
Reviewed By:	tuexen, jhb, #transport
Sponsored by: 	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29483
2021-03-31 19:24:21 +02:00
Andrey V. Elsukov
c80a4b76ce ipdivert: check that PCB is still valid after taking INPCB_RLOCK.
We are inspecting PCBs of divert sockets under NET_EPOCH section,
but PCB could be already detached and we should check INP_FREED flag
when we took INP_RLOCK.

PR:		254478
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29420
2021-03-30 12:31:09 +03:00
Richard Scheffenegger
cb0dd7e122 tcp: reduce memory footprint when listing tcp hostcache
In tcp_hostcache_list, the sbuf used would need a large (~2MB)
blocking allocation of memory (M_WAITOK), when listing a
full hostcache. This may stall the requestor for an indeterminate
time.

A further optimization is to return the expected userspace
buffersize right away, rather than preparing the output of
each current entry of the hostcase, provided by: @tuexen.

This makes use of the ready-made functions of sbuf to work
with sysctl, and repeatedly drain the much smaller buffer.

PR: 254333
MFC after: 2 weeks
Reviewed By: #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29471
2021-03-28 23:50:23 +02:00
Richard Scheffenegger
b9f803b7d4 tcp: Use PRR for ECN congestion recovery
MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28972
2021-03-26 02:06:15 +01:00
Richard Scheffenegger
eb3a59a831 tcp: Refactor PRR code
No functional change intended.

MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29411
2021-03-26 00:01:34 +01:00
Richard Scheffenegger
0533fab89e tcp: Perform simple fast retransmit when SACK Blocks are missing on SACK session
MFC after: 2 weeks
Reviewed By: #transport, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28634
2021-03-25 23:23:48 +01:00
Michael Tuexen
d995cc7e54 sctp: fix handling of RTO.initial of 1 ms
MFC after:	3 days
Reported by:	syzbot+5eb0e009147050056ce9@syzkaller.appspotmail.com
2021-03-22 16:44:18 +01:00
Michael Tuexen
40f41ece76 tcp: improve handling of SYN segments in SYN-SENT state
Ensure that the stack does not generate a DSACK block for user
data received on a SYN segment in SYN-SENT state.

Reviewed by:		rscheff
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D29376
Sponsored by:		Netflix, Inc.
2021-03-22 15:58:49 +01:00
Richard Scheffenegger
e9f029831f fix panic when rescue retransmission and FIN overlap
PR:           254244
PR:           254309
Reviewed By:  #transport, hselasky, tuexen
MFC after:    3 days
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29315
2021-03-17 17:12:04 +01:00