Commit Graph

549 Commits

Author SHA1 Message Date
Randall Stewart
631449d5d0 tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's
The push bit itself was also not actually being properly moved to
the right edge. The FIN bit was incorrectly on the left edge. We
fix these two issues as well as plumb in the mtu_change for
alternate stacks.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30413
2021-05-24 14:42:15 -04:00
Richard Scheffenegger
0471a8c734 tcp: SACK Lost Retransmission Detection (LRD)
Recover from excessive losses without reverting to a
retransmission timeout (RTO). Disabled by default, enable
with sysctl net.inet.tcp.do_lrd=1

Reviewed By: #transport, rrs, tuexen, #manpages
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D28931
2021-05-10 19:06:20 +02:00
Randall Stewart
9867224bab tcp:Host cache and rack ending up with incorrect values.
The hostcache up to now as been updated in the discard callback
but without checking if we are all done (the race where there are
more than one calls and the counter has not yet reached zero). This
means that when the race occurs, we end up calling the hc_upate
more than once. Also alternate stacks can keep there srtt/rttvar
in different formats (example rack keeps its values in microseconds).
Since we call the hc_update *before* the stack fini() then the
values will be in the wrong format.

Rack on the other hand, needs to convert items pulled from the
hostcache into its internal format else it may end up with
very much incorrect values from the hostcache. In the process
lets commonize the update mechanism for srtt/rttvar since we
now have more than one place that needs to call it.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30172
2021-05-10 11:25:51 -04:00
Randall Stewart
5d8fd932e4 This brings into sync FreeBSD with the netflix versions of rack and bbr.
This fixes several breakages (panics) since the tcp_lro code was
committed that have been reported. Quite a few new features are
now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents
comming soon on rack..

Sponsored by:           Netflix
Reviewed by:		rscheff, mtuexen
Differential Revision:	https://reviews.freebsd.org/D30036
2021-05-06 11:22:26 -04:00
Navdeep Parhar
01d74fe1ff Path MTU discovery hooks for offloaded TCP connections.
Notify the TOE driver when when an ICMP type 3 code 4 (Fragmentation
needed and DF set) message is received for an offloaded connection.
This gives the driver an opportunity to lower the path MTU for the
connection and resume transmission, much like what the kernel does for
the connections that it handles.

Reviewed by:	glebius@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D29755
2021-04-21 13:00:16 -07:00
Hans Petter Selasky
9ca874cf74 Add TCP LRO support for VLAN and VxLAN.
This change makes the TCP LRO code more generic and flexible with regards
to supporting multiple different TCP encapsulation protocols and in general
lays the ground for broader TCP LRO support. The main job of the TCP LRO code is
to merge TCP packets for the same flow, to reduce the number of calls to upper
layers. This reduces CPU and increases performance, due to being able to send
larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it
possible to avoid per-packet interaction by the host CPU.

Because the current TCP LRO code was tightly bound and optimized for TCP/IP
over ethernet only, several larger changes were needed. Also a minor bug was
fixed in the flushing mechanism for inactive entries, where the expire time,
"le->mtime" was not always properly set.

To avoid having to re-run time consuming regression tests for every change,
it was chosen to squash the following list of changes into a single commit:
- Refactor parsing of all address information into the "lro_parser" structure.
  This easily allows to reuse parsing code for inner headers.
- Speedup header data comparison. Don't compare field by field, but
  instead use an unsigned long array, where the fields get packed.
- Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed
  recursivly, only applying deltas as the result of updating payload data.
- Make smaller inline functions doing one operation at a time instead of
  big functions having repeated code.
- Refactor the TCP ACK compression code to only execute once
  per TCP LRO flush. This gives a minor performance improvement and
  keeps the code simple.
- Use sbintime() for all time-keeping. This change also fixes flushing
  of inactive entries.
- Try to shrink the size of the LRO entry, because it is frequently zeroed.
- Removed unused TCP LRO macros.
- Cleanup unused TCP LRO statistics counters while at it.
- Try to use __predict_true() and predict_false() to optimise CPU branch
  predictions.

Bump the __FreeBSD_version due to changing the "lro_ctrl" structure.

Tested by:	Netflix
Reviewed by:	rrs (transport)
Differential Revision:	https://reviews.freebsd.org/D29564
MFC after:	2 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-04-20 13:36:22 +02:00
Michael Tuexen
9e644c2300 tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D29469
2021-04-18 16:16:42 +02:00
Gleb Smirnoff
86046cf55f tcp_respond(): fix assertion, should have been done in 08d9c92027. 2021-04-16 15:39:51 -07:00
Gleb Smirnoff
08d9c92027 tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by:	rrs
Differential revision:	https://reviews.freebsd.org/D29576
2021-04-12 08:25:31 -07:00
Randall Stewart
69a34e8d02 Update the LRO processing code so that we can support
a further CPU enhancements for compressed acks. These
are acks that are compressed into an mbuf. The transport
has to be aware of how to process these, and an upcoming
update to rack will do so. You need the rack changes
to actually test and validate these since if the transport
does not support mbuf compression, then the old code paths
stay in place. We do in this commit take out the concept
of logging if you don't have a lock (which was quite
dangerous and was only for some early debugging but has
been left in the code).

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28374
2021-02-17 10:41:01 -05:00
Richard Scheffenegger
bc7ee8e5bc Address panic with PRR due to missed initialization of recover_fs
Summary:
When using the base stack in conjunction with RACK, it appears that
infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh.

This leaves the recover flightsize (sackhint.recover_fs) uninitialized,
leading to a div/0 panic.

Address this by properly initializing the variable just prior to first
use, if it is not properly initialized.

In order to prevent stale information from a prior recovery to
negatively impact the PRR calculations in this event, also clear
recover_fs once loss recovery is finished.

Finally, improve the readability of the initialization of recover_fs
when t_dupacks == tcprexmtthresh by adjusting the indentation and
using the max(1, snd_nxt - snd_una) macro.

Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages

Reviewed By: rrs, kbowling, #transport

Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro

Differential Revision: https://reviews.freebsd.org/D28114
2021-01-20 12:06:34 +01:00
Michael Tuexen
d2b3ceddcc tcp: add sysctl to tolerate TCP segments missing timestamps
When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.

Reviewed by:		jtl@, rscheff@
Differential Revision:	https://reviews.freebsd.org/D28142
Sponsored by:		Netflix, Inc.
PR:			252449
MFC after:		1 week
2021-01-14 19:28:25 +01:00
John Baldwin
ce39811544 Save the current TCP pacing rate in t_pacing_rate.
Reviewed by:	gallatin, gnn
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26875
2020-10-29 00:03:19 +00:00
Richard Scheffenegger
5432120028 Extend netstat to display TCP stack and detailed congestion state (2)
Extend netstat to display TCP stack and detailed congestion state

Adding the "-c" option used to show detailed per-connection
congestion control state for TCP sessions.

This is one summary patch, which adds the relevant variables into
xtcpcb. As previous "spare" space is used, these changes are ABI
compatible.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26518
2020-10-09 10:55:19 +00:00
Richard Scheffenegger
e399566123 TCP: send full initial window when timestamps are in use
The fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable.
That value does not consider tcp options like timestamps,
while the initial window calculation is using
the correct dynamic tcp_maxseg() function.

Due to this interaction, the last, full size segment
is considered too short and not sent out immediately.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26478
2020-09-25 10:38:19 +00:00
Michael Tuexen
42d7560796 Export the name of the congestion control. This will be used by sockstat
and netstat.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D26412
2020-09-13 09:06:50 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Randall Stewart
8315f1ea26 The recent changes to move the ref count increment
back from the end of the function created an issue.
If one of the routines returns NULL during setup
we have inp's with extra references (which is why
the increment was at the end).

Also the stack switch return code was being ignored
and actually has meaning if the stack cannot take over
it should return NULL.

Fix both of these situation by being sure to test the
return code and of course in any case of return NULL (there
are 3) make sure we properly reduce the ref count.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D25903
2020-07-31 10:03:32 +00:00
Richard Scheffenegger
c201ce0b4a Fix KASSERT during tcp_newtcpcb when low on memory
While testing with system default cc set to cubic, and
running a memory exhaustion validation, FreeBSD panics for a
missing inpcb reference / lock.

Reviewed by:	rgrimes (mentor), tuexen (mentor)
Approved by:	rgrimes (mentor), tuexen (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25583
2020-07-07 12:10:59 +00:00
Alexander V. Chernikov
a37a5246ca Use fib[46]_lookup() in mtu calculations.
fib[46]_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Differential Revision:	https://reviews.freebsd.org/D24974
2020-05-28 08:00:08 +00:00
Randall Stewart
e570d231f4 This change does a small prepratory step in getting the
latest rack and bbr in from the NF repo. When those come
in the OOB data handling will be fixed where Skyzaller crashes.

Differential Revision:	https://reviews.freebsd.org/D24575
2020-04-27 16:30:29 +00:00
Alexander V. Chernikov
983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Randall Stewart
481be5de9d White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by:	Netflix Inc.
2020-02-12 13:31:36 +00:00
Gleb Smirnoff
0452a1f3ef Add documenting NET_EPOCH_ASSERT() to tcp_drop(). 2020-01-22 02:38:46 +00:00
Gleb Smirnoff
bab98355f9 Add some documenting NET_EPOCH_ASSERTs. 2020-01-22 02:37:47 +00:00
Gleb Smirnoff
4c69f60a8e Fix yet another regression from r354484. Error code from cr_cansee()
aliases with hard error from other operations.

Reported by:	flo
2020-01-13 21:12:10 +00:00
Bjoern A. Zeeb
334fc5822b vnet: virtualise more network stack sysctls.
Virtualise tcp_always_keepalive, TCP and UDP log_in_vain.  All three are
set in the netoptions startup script, which we would love to run for VNETs
as well [1].

While virtualising the log_in_vain sysctls seems pointles at first for as
long as the kernel message buffer is not virtualised, it at least allows
an administrator to debug the base system or an individual jail if needed
without turning the logging on for all jails running on a system.

PR:		243193 [1]
MFC after:	2 weeks
2020-01-08 23:30:26 +00:00
Randall Stewart
1cf55767b8 This commit is a bit of a re-arrange of deck chairs. It
gets both rack and bbr ready for the completion of the STATs
framework in FreeBSD. For now if you don't have both NF_stats and
stats on it disables them. As soon as the rest of the stats framework
lands we can remove that restriction and then just uses stats when
defined.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D22479
2019-12-17 16:08:07 +00:00
Gleb Smirnoff
e5a084d020 Fix regression from r354484. Don't leak pcb lock if cr_canseeinpcb()
returns non-zero.

PR:		242415
2019-12-04 22:41:52 +00:00
Edward Tomasz Napierala
adc56f5a38 Make use of the stats(3) framework in the TCP stack.
This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time.  stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by:	thj
Tested by:	thj
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20655
2019-12-02 20:58:04 +00:00
Gleb Smirnoff
032677ceb5 Now that there is no R/W lock on PCB list the pcblist sysctls
handlers can be greatly simplified.  All the previous double
cycling and complex locking was added to avoid these functions
holding global PCB locks for extended period of time, preventing
addition of new entries.
2019-11-07 21:27:32 +00:00
Gleb Smirnoff
d797164a86 Since r353292 on input path we are always in network epoch, when
we lookup PCBs.  Thus, do not enter epoch recursively in
in_pcblookup_hash() and in6_pcblookup_hash().  Same applies to
tcp_ctlinput() and tcp6_ctlinput().

This leaves several sysctl(9) handlers that return PCB credentials
unprotected.  Add epoch enter/exit to all of them.

Differential Revision:	https://reviews.freebsd.org/D22197
2019-11-07 20:49:56 +00:00
Gleb Smirnoff
1a49612526 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
Michael Tuexen
71e85612a9 Replacing MD5 by SipHash improves the performance of the TCP time stamp
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.

Reviewed by:		gallatin@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21616
2019-09-28 13:13:23 +00:00
Randall Stewart
af9b9e0d9f This adds in the missing counter initialization which
I had forgotten to bring over.. opps.

Differential Revision: https://reviews.freebsd.org/D21127
2019-09-06 18:29:48 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
Michael Tuexen
d21036e033 Add a sysctl variable ts_offset_per_conn to change the computation
of the TCP TS offset from taking the IP addresses and the TCP port
numbers into account to a version just taking only the IP addresses
into account. This works around broken middleboxes or endpoints.
The default is to keep the behaviour, which is also the behaviour
recommended in RFC 7323.

Reported by:		devgs@ukr.net
Reviewed by:		rrs@
MFC after:		2 weeks
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D20980
2019-07-23 21:28:20 +00:00
John Baldwin
6b69072acc Reject attempts to register a TCP stack being unloaded.
Reviewed by:	gallatin
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20617
2019-06-27 22:34:05 +00:00
Michael Tuexen
0999766ddf Add sysctl variable net.inet.tcp.rexmit_initial for setting RTO.Initial
used by TCP.

Reviewed by:		rrs@, 0mp@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19355
2019-03-23 21:36:59 +00:00
John Baldwin
dbcc200058 Various cleanups to the management of multiple TCP stacks.
- Use strlcpy() with sizeof() instead of strncpy().

- Simplify initialization of TCP functions structures.

  init_tcp_functions() was already called before the first call to
  register a stack.  Just inline the work in the SYSINIT and remove
  the racy helper variable.  Instead, KASSERT that the rw lock is
  initialized when registering a stack.

- Protect the default stack via a direct pointer comparison.

  The default stack uses the name "freebsd" instead of "default" so
  this protection wasn't working for the default stack anyway.

Reviewed by:	rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19152
2019-02-27 20:24:23 +00:00
Mark Johnston
79db6fe7aa Plug some networking sysctl leaks.
Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland.  Fix a number of such handlers.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	tuexen
MFC after:	3 days
Security:	kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18301
2018-11-22 20:49:41 +00:00
Michael Tuexen
90ab3571d8 Use arc4rand() instead of read_random() in the SCTP and TCP code.
This was suggested by jmg@.

Reviewed by:		delphij@, jmg@, jtl@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16860
2018-08-23 19:10:45 +00:00
Michael Tuexen
4ba1513d1a Don't use the explicit number 32 for the length of the secrets,
use sizeof() or explicit #definesi instead. No functional change.
This was suggested by jmg@.

MFC after:		1 month
XMFC with:		r338053
Sponsored by:		Netflix, Inc.
2018-08-23 06:03:59 +00:00
Michael Tuexen
5dff1c3845 Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR:			173444
Reviewed by:		bz@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16796
2018-08-21 14:12:30 +00:00
Randall Stewart
c28440db29 This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D16626
2018-08-20 12:43:18 +00:00
Michael Tuexen
8e02b4e00c Don't expose the uptime via the TCP timestamps.
The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.

Reviewed by:		rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16636
2018-08-19 14:56:10 +00:00
Andrew Turner
5f901c92a8 Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147
2018-07-24 16:35:52 +00:00
Matt Macy
2269988749 NULL out cc_data in pluggable TCP {cc}_cb_destroy
When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has
a destructor (newreno_cb_destroy) for per connection state. Other congestion
controls may allocate and free cc_data on entry and exit, but the field is
never explicitly NULLed if moving back to NewReno which only internally
allocates stateful data (no entry contstructor) resulting in a situation where
newreno_cb_destory might be called on a junk pointer.

 -    NULL out cc_data in the framework after calling {cc}_cb_destroy
 -    free(9) checks for NULL so there is no need to perform not NULL checks
     before calling free.
 -    Improve a comment about NewReno in tcp_ccalgounload

This is the result of a debugging session from Jason Wolfe, Jason Eggleston,
and mmacy@ and very helpful insight from lstewart@.

Submitted by: Kevin Bowling
Reviewed by: lstewart
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16282
2018-07-22 05:37:58 +00:00
Matt Macy
6573d7580b epoch(9): allow preemptible epochs to compose
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
  there's no longer any benefit to dropping the pcbinfo lock
  and trying to do so just adds an error prone branchfest to
  these functions
- Remove cases of same function recursion on the epoch as
  recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
  thread as the tracker field is now stack or heap allocated
  as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
2018-07-04 02:47:16 +00:00