273 Commits

Author SHA1 Message Date
Randall Stewart
d6fbbe5d42 Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.
The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831
2023-04-03 04:11:43 -04:00
Andrew Gallatin
c0e4090e3d ktls: Accurately track if ifnet ktls is enabled
This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS.  This flag meant that
now, or in the past, ifnet ktls was active on a socket.  Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection.  Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to  ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380
2023-02-09 12:44:44 -05:00
Justin Hibbits
3d0d5b21c9 IfAPI: Explicitly include <net/if_private.h> in netstack
Summary:
In preparation of making if_t completely opaque outside of the netstack,
explicitly include the header.  <net/if_var.h> will stop including the
header in the future.

Sponsored by:	Juniper Networks, Inc.
Reviewed by:	glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38200
2023-01-31 15:02:16 -05:00
Richard Scheffenegger
18b83b626a tcp: reduce the size of t_rttupdated in tcpcb
During tcp session start, various mechanisms need to
track a few initial RTTs before becoming active.
Prevent overflows of the corresponding tracking counter
and reduce the size of tcpcb simultaneously.

Reviewed By:		#transport, tuexen, guest-ccui
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D21117
2023-01-26 18:08:00 +01:00
Gordon Bergling
73e994a998 extra_tcp_stacks: Fix a common typo in source code comments
- s/orginal/original/

MFC after:	3 days
2023-01-19 14:11:00 +01:00
Gordon Bergling
432a398d86 tcp_rack(4): Fix a typo in a source code comment
- s/postion/position/

MFC after:	3 days
2023-01-11 12:02:25 +01:00
Andrew Gallatin
8ea4182995 tcp: Build RACK and BBR stacks as a part of LINT
When RACK and BBR were added to the kernel, they were put
behind 'WITH_EXTRA_TCP_STACKS=1'.   Unfortunately that was
never added to any NOTES file, so RACK & BBR were not compiled
with the various LINT-NOINET, LINT-NOINET6, and LINT-NOIP kernels.
This lead to the stacks sometimes being broken.

This change:

- Fixes RACK so that it compiles with the various LINT-NO* kernels
- Adds WITH_EXTRA_TCP_STACKS=1 to all NOTES kernels so that
   RACK and BBR are compile tested regularly

Sponsored by: Netflix
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D37903
2023-01-10 16:16:43 -05:00
Randall Stewart
2e2a1c3139 Opps take out a stray left behind printf that was
for debugging.. Sorry.
2022-12-14 16:11:39 -05:00
Randall Stewart
e2e088ae86 Rack cannot be loaded without cc_newreno compiled into the kernel.
Right now rack will fail to load due to its hack in accessing symbol names
in cc_newreno. This was fine when newreno was always compiled into the
kernel but now ... not so much. Instead lets fix up rack to use the socket
option queries to get the information it wants and set the parameters. We
also fix the CC parameter so they are always settable.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D37622
2022-12-14 15:37:48 -05:00
Gleb Smirnoff
eaabc93764 tcp: retire TCPDEBUG
This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.

We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket.  Also the tcp::debug DTrace probes look at
this flag on a socket.

Reviewed by:		gnn, tuexen
Discussed with:		rscheff, rrs, jtl
Differential revision:	https://reviews.freebsd.org/D37694
2022-12-14 09:54:06 -08:00
Mateusz Guzik
e6fc01f6be tcp: whack the stale declaration of rack_timer_stop
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-12-14 08:48:52 +00:00
Gleb Smirnoff
446ccdd08e tcp: use single locked callout per tcpcb for the TCP timers
Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts.  Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held.  Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

With single callout per tcpcb on connection teardown we should be able
to fully stop the callout and immediately free it, avoiding use of
callout_async_drain().  There is one gotcha here: callout_stop() can
actually touch our memory when a rare race condition happens.  See
comment above tcp_timer_stop().  Synchronous stop of the callout makes
tcp_discardcb() the single entry point for tcpcb destructor, merging the
tcp_freecb() to the end of the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions.  With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers.  The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Discussed with:		rrs, tuexen, rscheff	(the TCP part of the diff)
Reviewed by:		hselasky, kib, mav	(the callout part)
Differential revision:	https://reviews.freebsd.org/D37321
2022-12-07 09:00:48 -08:00
Gleb Smirnoff
918fa4227d tcp: remove tcp_timer_suspend()
It was a temporary code added together with RACK to fight against
TCP timer races.
2022-12-07 09:00:48 -08:00
Gleb Smirnoff
e68b379244 tcp: embed inpcb into tcpcb
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.

The most import one is the inpcb itself.  With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa.  Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that.  However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort.  The new intotcpcb() macro is ready for such move.

The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.

One interesting side effect is that now TCP data is allocated from
SMR-protected zone.  Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.

Large part of the change was done with sed script:

s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g

Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.

Differential revision:	https://reviews.freebsd.org/D37127
2022-12-07 09:00:48 -08:00
Michael Tuexen
bd4f986644 tcp: remove unused t_rttbest
No functional change intended.

Reviewed by:		rscheff@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D37401
2022-11-16 11:22:13 +01:00
Gleb Smirnoff
9eb0e8326d tcp: provide macros to access inpcb and socket from a tcpcb
There should be no functional changes with this commit.

Reviewed by:		rscheff
Differential revision:	https://reviews.freebsd.org/D37123
2022-11-08 10:24:40 -08:00
Gordon Bergling
bcf8fb7f03 tcp_bbr(4): Fix a typo in a source code comment
- s/retranmitted/retransmitted/

MFC after:	3 days
2022-11-08 14:59:56 +01:00
Richard Scheffenegger
b1258b7643 tcp: add conservative d.cep accounting algorithm
Accurate ECN asks to conservatively estimate, when the
ACE counter may have wrapped due to a single ACK covering a larger
number of segments. This is described in Annex A.2 of the
accurate-ecn draft.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37281
2022-11-06 12:05:22 +01:00
Gleb Smirnoff
f504685a7a rack/bbr: put back assertion that connection is not in TIME-WAIT
The assertion was incorrectly removed in 0d7445193ab.  The leak of
a TIME-WAIT state into tfb_do_segment_nounlock method was fixed in
31bc602ff81.  The TIME-WAIT connections are processed by the main
tcp_input() always.
2022-10-31 08:30:59 -07:00
Richard Scheffenegger
83c1ec92e4 tcp: ECN preparations for ECN++, AccECN (tcp_respond)
tcp_respond is another function to build a tcp control packet
quickly. With ECN++ and AccECN, both the IP ECN header, and
the TCP ECN flags are supposed to reflect the correct state.

Also ensure that on receiving multiple ECN SYN-ACKs, the
responses triggered will reflect the latest state.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36973
2022-10-20 21:48:27 +02:00
Gleb Smirnoff
53af690381 tcp: remove INP_TIMEWAIT flag
Mechanically cleanup INP_TIMEWAIT from the kernel sources.  After
0d7445193ab, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions.  Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision:	https://reviews.freebsd.org/D36400
2022-10-06 19:24:37 -07:00
Gleb Smirnoff
0d7445193a tcp: remove tcptw, the compressed timewait state structure
The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no
longer justify the complexity required to maintain it.  For longer
explanation please check out the email [1].

Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot.  The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].

This change just removes tcptw and leaves INP_TIMEWAIT.  The flag will
be removed in a separate commit.  This makes it easier to review and
possibly debug the changes.

[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite

Differential revision:	https://reviews.freebsd.org/D36398
2022-10-06 19:22:23 -07:00
Konstantin Belousov
2220b66fe0 Add mbuf_tstmp2timeval()
Reviewed by:	hselasky, jkim, rscheff
Sponsored by:	NVIDIA networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36870
2022-10-06 00:38:13 +03:00
Randall Stewart
e5049a1733 TCP rack does not work properly with cubic.
Right now if you use rack with cubic (the new default cc) you will have
improper results. This is because rack uses different variables than
the base stack (or bbr) and thus tcp_compute_pipe() always returns
so that cubic will choose a 30% backoff not the 50% backoff it should
when it is newreno compatibility mode. The fix is to allow a stack (rack)
to override its own compute_pipe.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36711
2022-09-26 15:12:03 -04:00
Gleb Smirnoff
493105c2a8 tcp: fix simultaneous open and refine e80062a2d43
- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
  is also necessary for a half-synchronized connection.  Fix that
  just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
  soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet.  For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete.  The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup().  Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case.  However, this doesn't yet work.

Reviewed by:		rscheff, tuexen, rrs
Differential revision:	https://reviews.freebsd.org/D36641
2022-09-21 14:02:49 -07:00
Randall Stewart
81560c5582 TCP: Rack ends up sending all that is outstanding every timeout.
In doing some testing for a different problem, I have found rack retransmitting
all outstanding data every time a timeout occurs. The outstanding is sent 1ms
apart between each packet, and then the timeout runs off again. This causes
extra retransmissions when we should be waiting for an ack after sending the
very first segment.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36494
2022-09-09 08:59:21 -04:00
Gordon Bergling
c3679af313 tcp_rack: Correct some typos in source code comments
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:58:13 +02:00
Gordon Bergling
fa52f9dc9a tcp_rack: Fix two typos in source code comments
- s/overriden/overridden/

MFC after:	3 days
2022-09-03 15:05:42 +02:00
Richard Scheffenegger
4012ef7754 tcp: Functional implementation of Accurate ECN
The AccECN handshake and TCP header flags are supported,
no support yet for the AccECN option. This minimalistic
implementation is sufficient to support DCTCP while
dramatically cutting the number of ACKs, and provide ECN
response from the receiver to the CC modules.

Reviewed By:		#transport, #manpages, rrs, pauamma
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D21011
2022-08-31 15:05:53 +02:00
Randall Stewart
62ce18fc9a tcp: Rack rwnd collapse.
Currently when the peer collapses its rwnd, we mark packets to be retransmitted
and use the must_retran flags like we do when a PMTU collapses to retransmit the
collapsed packets. However this causes a problem with some middle boxes that
play with the rwnd to control flow. As soon as the rwnd increases we start resending
which may be not even a rtt.. and in fact the peer may have gotten the packets. Which
means we gratuitously retransmit packets we should not.

The fix here is to make sure that a rack time has passed before retransmitting the packets.
This makes sure that the rwnd collapse was real and the packets do need retransmission.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D35166
2022-08-23 09:17:05 -04:00
Dimitry Andric
57cdd13d07 Suppress unused variable warning in tcp_stacks's rack.c
With clang 15, the following -Werror warning is produced:

    sys/netinet/tcp_stacks/rack.c:17405:12: error: variable 'outstanding' set but not used [-Werror,-Wunused-but-set-variable]
                    uint32_t outstanding;
                             ^

The 'outstanding' variable was used later in the rack_output() function,
but refactoring in 35c7bb340788f removed the usage. To avoid too much
code churn, mark the variable unused to supress the warning.

MFC after:	3 days
2022-08-14 21:27:35 +02:00
Dimitry Andric
e967183cb0 Fix unused variable warning in tcp_stacks's rack.c
With clang 15, the following -Werror warning is produced:

    sys/netinet/tcp_stacks/rack.c:16148:6: error: variable 'cnt_thru' set but not used [-Werror,-Wunused-but-set-variable]
            int cnt_thru = 1;
                ^

The 'cnt_thru' variable is only used when TCP_ACCOUNTING is defined.
Ensure it is only declared and set in that case.

MFC after:	3 days
2022-08-14 21:27:34 +02:00
Dimitry Andric
7624896571 Fix unused variable warning in tcp_stacks's bbr.c
With clang 15, the following -Werror warning is produced:

sys/netinet/tcp_stacks/bbr.c:11925:11: error: variable 'rtr_cnt' set but not used [-Werror,-Wunused-but-set-variable]
        uint32_t rtr_cnt = 0;
                 ^

The 'rtr_cnt' variable was in bbr.c when it was first added, but it
appears to have been a debugging aid that has never been used, so remove
it.

MFC after:	3 days
2022-08-14 21:27:34 +02:00
Michael Tuexen
1abc27dd52 tcp rack: simplify computation of rsm start and end
While there, also fix the setting of the SYN related flag.

Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D35862
2022-08-02 12:45:56 +02:00
Michael Tuexen
5b741298b1 tcp rack: fix switching to RACK when FIN has been sent
Fix the rack sendmap entry in case a FIN has been sent when the
stack is switched over to RACK.

Reported by:		syzbot+dd55e316428419e9354b@syzkaller.appspotmail.com
Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D35731
2022-07-19 20:28:25 +02:00
Gleb Smirnoff
aeb6948d43 bbr: check proper flag for connection had been closed
An older version of D35663 slipped through final reviews.

Submitted by:	Peter Lei
Fixes:		74703901d8bbc3bc7a29df648bc3c131c87393c2
2022-07-08 22:04:44 -07:00
Gleb Smirnoff
74703901d8 tcp: use a TCP flag to check if connection has been close(2)d
The flag SS_NOFDREF is a private flag of the socket layer.  It also
is supposed to be read with SOCK_LOCK(), which we don't own here.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D35663
2022-07-04 12:40:51 -07:00
Gordon Bergling
32a01b2b86 rack: Fix a common typo in comments and a sysctl description
- s/multipler/multiplier/

MFC after:	3 days
2022-06-04 17:56:56 +02:00
Gordon Bergling
c93db89231 rack: Fix a typo in a source code comment
- s/enought/enough/

MFC after:	3 days
2022-06-04 15:32:59 +02:00
Gordon Bergling
bd9e23c0a9 rack: Fix a typo in a source code comment
- s/continous/continuous/

MFC after:	3 days
2022-06-04 13:27:29 +02:00
Gordon Bergling
21b923c330 tcp_rack: Fix two typos in sysctl descriptions
- s/higest/highest/

MFC after:	3 days
2022-06-04 11:24:18 +02:00
Gleb Smirnoff
4581cffb3d sockets: fix build, convert missed sbreserve_locked() calls
Fixes:	4328318445ae
2022-05-12 14:29:19 -07:00
Randall Stewart
04831efd9f tcp: Rack idle reduce not working.
Rack converted to micro-seconds quite some time ago, but in testing
we have found a miss in that work. The idle reduce time is still based
in ticks, so it must be converted to microseconds before any comparisons
else you will likely not do idle reduce.

Reviewed by: tuexen, thj
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D35066
2022-05-10 09:46:05 -04:00
Mateusz Guzik
0fd5c29944 tcp/rack: plug a set-but-not-used var
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-19 09:33:35 +00:00
Randall Stewart
6edfc10ca5 tcp: adding a functionality to define "trace points" so that BB logging can be enabled at specific events.
This commit will add a new concept to rack, tracepoints. A tracepoint
is a defined point inserted into the code (3 are included in this initial patch) that
allows a developer to insert a point that might be of interest. The developer numbers
the point in the tcp_rack.h file and then can use sysctl to enable that (or all) trace
points. A limit is also given to how many BB logged connections will turn on
so that a box is not overrun by BB logging.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34898
2022-04-14 16:07:34 -04:00
Richard Scheffenegger
033718abc8 tcp: Whitespace cleanup in brr and rack
Whitespace cleanup (leading spaces to tabs)
Nicefy function definitions with indentations

No functional change

Reviewed By: #transport, thj
Sponsored by:   NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D30043
2022-04-13 12:49:57 +02:00
John Baldwin
86fa80f320 rack: Remove unused variable. 2022-04-12 14:59:00 -07:00
Gordon Bergling
2dd0c2bc7f tcp_bbr(4): Fix a typo in a source code comment
- s/possiblity/possibility/

MFC after:	3 days
2022-04-09 13:26:20 +02:00
Gordon Bergling
addb2c6585 tcp_rack: Fix a typo in a source code comment
- s/possiblity/possibility/

MFC after:	3 days
2022-04-09 13:25:50 +02:00
Gordon Bergling
36814092d4 tcp_rack: Fix a few typos in sysctl descriptions and comments
- s/postion/position/
- s/postions/positions/
- s/repostion/reposition/

MFC after:	5 days
2022-04-09 09:13:10 +02:00