Commit Graph

7545 Commits

Author SHA1 Message Date
Gleb Smirnoff
f567d55f51 inpcb: don't return INP_DROPPED entries from pcb lookups
The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a
pcb from hash list and mark it as dropped.  The comment suggests that
such pcb won't be returned by lookups.  Indeed, every call to
in_pcblookup*() is accompanied by a check for INP_DROPPED.  Do what
comment suggests: never return such pcbs and remove unnecessary checks.

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D37061
2022-11-08 10:24:39 -08:00
Richard Scheffenegger
dc9daa04fb tcp: allow packets to be marked as ECT1 instead of ECT0
This adds the capability for a modular congestion control
to select which variant of ECN-capable-transport it wants to use
when sending out elegible segments. As an initial CC to utilize
this, DCTCP was selected.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24869
2022-11-08 18:36:38 +01:00
Gordon Bergling
bcf8fb7f03 tcp_bbr(4): Fix a typo in a source code comment
- s/retranmitted/retransmitted/

MFC after:	3 days
2022-11-08 14:59:56 +01:00
Michael Tuexen
126f8248cc Unbreak builds having SCTP support compiled in
Including sctp_var.h requires INET to be defined if IPv4 support
is needed.
2022-11-07 08:50:51 +01:00
Michael Tuexen
f83db6441a sctp: minor changes due to upstreaming of Glebs recent changes 2022-11-06 23:06:40 +01:00
Richard Scheffenegger
37bf391d3c tcp: make tcp_packets_this_ack() only visible in kernel scope 2022-11-06 13:51:57 +01:00
Richard Scheffenegger
004bb636ca tcp: Move sysctl OIDs related to ECN to tcp_ecn.c
Keep all ECN related code in (mostly) one place.

No functional change.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37285
2022-11-06 12:38:42 +01:00
Richard Scheffenegger
b1258b7643 tcp: add conservative d.cep accounting algorithm
Accurate ECN asks to conservatively estimate, when the
ACE counter may have wrapped due to a single ACK covering a larger
number of segments. This is described in Annex A.2 of the
accurate-ecn draft.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37281
2022-11-06 12:05:22 +01:00
Richard Scheffenegger
22c81cc516 tcp: add AccECN CE packet counters to tcpinfo
Provide diagnostics information around AccECN into
the tcpinfo struct.

Event:			IETF 115 Hackathon
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37280
2022-11-06 11:56:02 +01:00
Richard Scheffenegger
3708c3d370 tcp: reserve tcp_info counters for AccECN
Marking all new fields unused (__xxx).

No functional change.

Reviewed By:		tuexen, rrs, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D37016
2022-11-04 10:18:53 +01:00
Mark Johnston
d93ec8cb13 inpcb: Allow SO_REUSEPORT_LB to be used in jails
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed
process.  It is trivial to support this option in VNET jails, but it's
also useful in traditional jails.

This patch enables LB groups in jails with the following semantics:
- all PCBs in a group must belong to the same jail,
- PCB lookup prefers jailed groups to non-jailed groups

This is a straightforward extension of the semantics used for individual
listening sockets.  One pre-existing quirk of the lbgroup implementation
is that non-jailed lbgroups are searched before jailed listening
sockets; that is preserved with this change.

Discussed with:	glebius
MFC after:	1 month
Sponsored by:	Modirum MDPay
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D37029
2022-11-02 13:46:24 -04:00
Mark Johnston
a152dd8634 inpcb: Remove a PCB from its LB group upon a subsequent error
If a memory allocation failure causes bind to fail, we should take the
inpcb back out of its LB group since it's not prepared to handle
connections.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D37027
2022-11-02 13:46:24 -04:00
Mark Johnston
ac1750dd14 inpcb: Remove NULL checks of credential references
Some auditing of the code shows that "cred" is never non-NULL in these
functions, either because all callers pass a non-NULL reference or
because they unconditionally dereference "cred".  So, let's simplify the
code a bit and remove NULL checks.  No functional change intended.

Reviewed by:	glebius
MFC after:	1 week
Sponsored by:	Modirum MDPay
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D37025
2022-11-02 13:46:24 -04:00
Gleb Smirnoff
c348e88053 tcp: make tcp_handle_wakeup() static and robust
It is called only from tcp_input() and always has valid parameter.

Reviewed by:		rscheff, tuexen
Differential revision:	https://reviews.freebsd.org/D37115
2022-10-31 08:57:15 -07:00
Gleb Smirnoff
19acc50667 inpcb: retire suppresion of randomization of ephemeral ports
The suppresion was added in 5f311da2cc with no explanation in the
commit message of the exact problem that was fixed. In the BSDCan
2006 talk [1], slides 12 to 14, we can find that it seems that there
was some problem with the TIME_WAIT state not properly being handled
on the remote side (also FreeBSD!), and this switching off the
suppression had hidden the problem.  The rationale of the change was
that other stacks may also be buggy wrt the TIME_WAIT.

I did not find the actual problem in TIME_WAIT that the suppression
has hidden, neither a commit that would fix it.  However, since that
time we started to handle SYNs with RFC5961 instead of RFC793, see
3220a2121c.  We also now have the tcp-testsuite [2], that has full
coverage of all possible scenarios of receiving SYN in TIME_WAIT.

This effectively reverts 5f311da2cc
and 6ee79c59d2.

[1] https://www.bsdcan.org/2006/papers/ImprovingTCPIP.pdf
[2] https://github.com/freebsd-net/tcp-testsuite

Reviewed by:		rscheff
Discussed with:		rscheff, rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D37042
2022-10-31 08:57:11 -07:00
Gleb Smirnoff
65a58d6390 icmp: doesn't need tcp_var.h 2022-10-31 08:44:55 -07:00
Gleb Smirnoff
f504685a7a rack/bbr: put back assertion that connection is not in TIME-WAIT
The assertion was incorrectly removed in 0d7445193a.  The leak of
a TIME-WAIT state into tfb_do_segment_nounlock method was fixed in
31bc602ff8.  The TIME-WAIT connections are processed by the main
tcp_input() always.
2022-10-31 08:30:59 -07:00
Gleb Smirnoff
77fe40cf2f netinet*: add back necessary headers
The LINT successful build was provided by the includes that SCTP
pulled in.

Fixes:	92e190f11f
2022-10-26 08:16:44 -07:00
Gleb Smirnoff
92e190f11f netinet*: remove unneeded headers from files that just declare domains 2022-10-25 11:09:23 -07:00
Gleb Smirnoff
eda633455a tcp: remove useless today lock assertion in a middle of function
It was added back in 7cfc690440, when there was a jump label
above and tcp_input() hadn't been locked all through.
2022-10-25 11:09:22 -07:00
Randall Stewart
31bc602ff8 Rack and BBR broken with the new timewait state purge.
We recently got rid of the explicit INP_TIMEWAIT state, this has caused some
minor breakage to both rack and bbr. Basically the timewait check that was
in tcp_lro.c is now gone. This means that compressed_ack and mbuf_queued
packets will arrive at TCP without going through tcp_input_with_port(). We need
to expand the check that was stripped to look at the tcp_state (t_state) and
not "LRO" packets that are in the TCPS_TIMEWAIT state.

Reviewed by: tuexen, gliebus
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D37080
2022-10-24 15:47:29 -04:00
Richard Scheffenegger
83c1ec92e4 tcp: ECN preparations for ECN++, AccECN (tcp_respond)
tcp_respond is another function to build a tcp control packet
quickly. With ECN++ and AccECN, both the IP ECN header, and
the TCP ECN flags are supposed to reflect the correct state.

Also ensure that on receiving multiple ECN SYN-ACKs, the
responses triggered will reflect the latest state.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36973
2022-10-20 21:48:27 +02:00
Gleb Smirnoff
24cf7a8d62 inpcb: provide pcbinfo pointer argument to inp_apply_all()
Allows to clear inpcb layer of TCP knowledge.
2022-10-19 15:15:53 -07:00
Gleb Smirnoff
b6a816f116 inpcb: garbage collect so_sototcpcb()
It had very little use and required inpcb layer to know tcpcb.
2022-10-19 15:15:53 -07:00
Gleb Smirnoff
c37384665f tcp: style the struct tcpcb definition
- Use C99 types uintXX_t instead of u_intXX_t.
- Try to make space/tab usage a little bit more consistent.
- Shorten comments to fit into 80 chars.

Not a functional change, just making future changes easier to read.
2022-10-18 18:00:30 -07:00
Kristof Provost
a974702e27 pf: apply the network stack's ICMP rate limiting to ICMP errors sent by pf
PR:		266477
Event:		Aberdeen Hackathon 2022
Differential Revision:	https://reviews.freebsd.org/D36903
2022-10-14 10:36:16 +02:00
Gleb Smirnoff
3ba34b07a4 inpcb: provide in_pcbremhash() to reduce copy-paste 2022-10-13 09:03:38 -07:00
Michael Tuexen
dd36606b1b sctp: improve sending of ABORT packets in response to INIT-ACKs
Ensure that the initiate tag of the INIT-ACK chunk is used as the
verification tag of the packet containing the ABORT chunk.

Reported by:	Suganya Dharma
MFC after:	1 week
2022-10-13 01:05:44 +02:00
Alexander Motin
1e9482f433 inet: Simplify if_multiaddrs iteration.
Similar to 2cd6ad766e for inet6 drop ifma_restart use, creating more
problems than solving.  It is no longer needed after epoch introduction.

While there, add NULL check for ifma_ifp in igmp_change_state(), that
sometimes caused panics on interface destruction.

MFC after:	2 weeks
2022-10-08 13:10:07 -04:00
Richard Scheffenegger
6bf91573c1 tcp: update repeat <SYN,ACK> with latest IP ECN info
When multiple <SYN> segments are received, update the <SYN,ACK>
sent in response to the latest IP ECN and TCP ECN information.

On retransmitting the <SYN,ACK>, once ECN maxtries are done, not
only disable RFC3168 ECN, but AccECN also.

Reviewed By:    	tuexen, #transport
Sponsored by:   	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36875
2022-10-07 01:51:19 +02:00
Richard Scheffenegger
265d0f767c tcp: honor rfc1323 sysctl on passive sessions
On passive sessions, honor the local settings disabling or
enabling window scaling and timestamp options.

Reviewed By:    	tuexen, #transport
Sponsored by:   	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36874
2022-10-07 01:49:10 +02:00
Richard Scheffenegger
9c65583835 siftr: apply filter early on
Quickly check TCP port filter, before investing into
expensive operations.

No functional change.

Obtained from:  	guest-ccui
Reviewed By:    	#transport, tuexen, guest-ccui
Sponsored by:   	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36842
2022-10-07 01:39:41 +02:00
Gleb Smirnoff
53af690381 tcp: remove INP_TIMEWAIT flag
Mechanically cleanup INP_TIMEWAIT from the kernel sources.  After
0d7445193a, this commit shall not cause any functional changes.

Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions.  Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.

Differential revision:	https://reviews.freebsd.org/D36400
2022-10-06 19:24:37 -07:00
Gleb Smirnoff
9c3507f919 tcp: in tcp_usr_detach() remove special handling of compressed time-wait
Differential revision:	https://reviews.freebsd.org/D36399
2022-10-06 19:24:32 -07:00
Gleb Smirnoff
0d7445193a tcp: remove tcptw, the compressed timewait state structure
The memory savings the tcptw brought back in 2003 (see 340c35de6a) no
longer justify the complexity required to maintain it.  For longer
explanation please check out the email [1].

Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot.  The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].

This change just removes tcptw and leaves INP_TIMEWAIT.  The flag will
be removed in a separate commit.  This makes it easier to review and
possibly debug the changes.

[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite

Differential revision:	https://reviews.freebsd.org/D36398
2022-10-06 19:22:23 -07:00
Konstantin Belousov
2220b66fe0 Add mbuf_tstmp2timeval()
Reviewed by:	hselasky, jkim, rscheff
Sponsored by:	NVIDIA networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36870
2022-10-06 00:38:13 +03:00
Hans Petter Selasky
c2a808b977 Fix kernel build after fcb3f813f3 .
By adding missing ifdefs for INET6 .

Differential Revision:	https://reviews.freebsd.org/D36731
Sponsored by:	NVIDIA Networking
2022-10-04 15:55:36 +02:00
Randall Stewart
cd84e78f09 tcp idle reduce does not work for a server.
TCP has an idle-reduce feature that allows a connection to reduce its
cwnd after it has been idle more than an RTT. This feature only works
for a sending side connection. It does this by at output checking the
idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout.

The problem comes if you are a web server. You get a request and
then send out all the data.. then go idle. The next time you would
send is in response to a request from the peer asking for more data.
But the thing is you updated t_rcvtime when the request came in so
you never reduce.

The fix is to do the idle reduce check also on inbound.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36721
2022-10-04 07:09:01 -04:00
Gleb Smirnoff
77198a945a tcp_timers: provide tcp_timer_drop() and tcp_timer_close()
Two functions to call tcp_drop() and tcp_close() from a callout context.
Garbage collect tcp_inpinfo_lock_del(), it has a single use now.

Differential revision:	https://reviews.freebsd.org/D36397
2022-10-03 22:21:55 -07:00
Gleb Smirnoff
775e20c159 tcp: make tcp_drop_syn_sent() static 2022-10-03 21:11:17 -07:00
Gleb Smirnoff
fcb3f813f3 netinet*: remove PRC_ constants and streamline ICMP processing
In the original design of the network stack from the protocol control
input method pr_ctlinput was used notify the protocols about two very
different kinds of events: internal system events and receival of an
ICMP messages from outside.  These events were coded with PRC_ codes.
Today these methods are removed from the protosw(9) and are isolated
to IPv4 and IPv6 stacks and are called only from icmp*_input().  The
PRC_ codes now just create a shim layer between ICMP codes and errors
or actions taken by protocols.

- Change ipproto_ctlinput_t to pass just pointer to ICMP header.  This
  allows protocols to not deduct it from the internal IP header.
- Change ip6proto_ctlinput_t to pass just struct ip6ctlparam pointer.
  It has all the information needed to the protocols.  In the structure,
  change ip6c_finaldst fields to sockaddr_in6.  The reason is that
  icmp6_input() already has this address wrapped in sockaddr, and the
  protocols want this address as sockaddr.
- For UDP tunneling control input, as well as for IPSEC control input,
  change the prototypes to accept a transparent union of either ICMP
  header pointer or struct ip6ctlparam pointer.
- In icmp_input() and icmp6_input() do only validation of ICMP header and
  count bad packets.  The translation of ICMP codes to errors/actions is
  done by protocols.
- Provide icmp_errmap() and icmp6_errmap() as substitute to inetctlerrmap,
  inet6ctlerrmap arrays.
- In protocol ctlinput methods either trust what icmp_errmap() recommend,
  or do our own logic based on the ICMP header.

Differential revision:	https://reviews.freebsd.org/D36731
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
c0fc81e913 netinet*: remove dead code from TCP, UDP, SCTP control input
Now these functions are called only from icmp*_input().  The pointer
to the ICMP data is never NULL and cmd has a limited set of values.

In the past the functions were demultiplexing control messages from
ICMP layer, as well as internally generated events.  In the latter
case the the pointer to IP would be NULL.

Differential revision:	https://reviews.freebsd.org/D36729
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
7f3b00a87a netinet: filter out invalid ICMP responses in ip_icmp()
instead of doing that in every ipproto_ctlinput_t method.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36728
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
53807a8a27 netinet*: use sparse C99 initializer for inetctlerrmap
and mark those PRC_* codes, that are used.  The rest are dead code.
This is not a functional change, but illustrative to make easier
review of following changes.
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
43d39ca7e5 netinet*: de-void control input IP protocol methods
After decoupling of protosw(9) and IP wire protocols in 78b1fc05b2 for
IPv4 we got vector ip_ctlprotox[] that is executed only and only from
icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed
only and only from icmp6_input().  This allows to use protocol specific
argument types in these methods instead of struct sockaddr and void.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36727
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
46ddeb6be8 netinet6: retire ip6protosw.h
The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d2 in 2001 and the latter survived until
today.  It has been reduced down to only one useful declaration that
moves to ip6_var.h

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36726
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
0ab46f28dc tcp: remove unnecessary include of tcp6_var.h
Reviewed by:		rscheff, melifaro
Differential revision:	https://reviews.freebsd.org/D36725
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
bb77f0c204 udp: typedef udp tunneling functions to functions, not pointers
With this change one can make a forward declaration of a function
that is of UDP tunneling type.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36724
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
24b96f35b9 netinet*: move ipproto_register() and co to ip_var.h and ip6_var.h
This is a FreeBSD KPI and belongs to private header not netinet/in.h.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36723
2022-10-03 20:53:04 -07:00
Richard Scheffenegger
4edff766cb tcp: correct simultaneous SYN ECN reaction in RFC3168 mode.
Ensure that an RFC3168 ECN reaction only occurs on non-SYN
segments.

Reviewed By:    	tuexen, #transport
Sponsored by:   	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36867
2022-10-04 00:24:28 +02:00