Commit Graph

7509 Commits

Author SHA1 Message Date
Hans Petter Selasky
c2a808b977 Fix kernel build after fcb3f813f3 .
By adding missing ifdefs for INET6 .

Differential Revision:	https://reviews.freebsd.org/D36731
Sponsored by:	NVIDIA Networking
2022-10-04 15:55:36 +02:00
Randall Stewart
cd84e78f09 tcp idle reduce does not work for a server.
TCP has an idle-reduce feature that allows a connection to reduce its
cwnd after it has been idle more than an RTT. This feature only works
for a sending side connection. It does this by at output checking the
idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout.

The problem comes if you are a web server. You get a request and
then send out all the data.. then go idle. The next time you would
send is in response to a request from the peer asking for more data.
But the thing is you updated t_rcvtime when the request came in so
you never reduce.

The fix is to do the idle reduce check also on inbound.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36721
2022-10-04 07:09:01 -04:00
Gleb Smirnoff
77198a945a tcp_timers: provide tcp_timer_drop() and tcp_timer_close()
Two functions to call tcp_drop() and tcp_close() from a callout context.
Garbage collect tcp_inpinfo_lock_del(), it has a single use now.

Differential revision:	https://reviews.freebsd.org/D36397
2022-10-03 22:21:55 -07:00
Gleb Smirnoff
775e20c159 tcp: make tcp_drop_syn_sent() static 2022-10-03 21:11:17 -07:00
Gleb Smirnoff
fcb3f813f3 netinet*: remove PRC_ constants and streamline ICMP processing
In the original design of the network stack from the protocol control
input method pr_ctlinput was used notify the protocols about two very
different kinds of events: internal system events and receival of an
ICMP messages from outside.  These events were coded with PRC_ codes.
Today these methods are removed from the protosw(9) and are isolated
to IPv4 and IPv6 stacks and are called only from icmp*_input().  The
PRC_ codes now just create a shim layer between ICMP codes and errors
or actions taken by protocols.

- Change ipproto_ctlinput_t to pass just pointer to ICMP header.  This
  allows protocols to not deduct it from the internal IP header.
- Change ip6proto_ctlinput_t to pass just struct ip6ctlparam pointer.
  It has all the information needed to the protocols.  In the structure,
  change ip6c_finaldst fields to sockaddr_in6.  The reason is that
  icmp6_input() already has this address wrapped in sockaddr, and the
  protocols want this address as sockaddr.
- For UDP tunneling control input, as well as for IPSEC control input,
  change the prototypes to accept a transparent union of either ICMP
  header pointer or struct ip6ctlparam pointer.
- In icmp_input() and icmp6_input() do only validation of ICMP header and
  count bad packets.  The translation of ICMP codes to errors/actions is
  done by protocols.
- Provide icmp_errmap() and icmp6_errmap() as substitute to inetctlerrmap,
  inet6ctlerrmap arrays.
- In protocol ctlinput methods either trust what icmp_errmap() recommend,
  or do our own logic based on the ICMP header.

Differential revision:	https://reviews.freebsd.org/D36731
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
c0fc81e913 netinet*: remove dead code from TCP, UDP, SCTP control input
Now these functions are called only from icmp*_input().  The pointer
to the ICMP data is never NULL and cmd has a limited set of values.

In the past the functions were demultiplexing control messages from
ICMP layer, as well as internally generated events.  In the latter
case the the pointer to IP would be NULL.

Differential revision:	https://reviews.freebsd.org/D36729
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
7f3b00a87a netinet: filter out invalid ICMP responses in ip_icmp()
instead of doing that in every ipproto_ctlinput_t method.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36728
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
53807a8a27 netinet*: use sparse C99 initializer for inetctlerrmap
and mark those PRC_* codes, that are used.  The rest are dead code.
This is not a functional change, but illustrative to make easier
review of following changes.
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
43d39ca7e5 netinet*: de-void control input IP protocol methods
After decoupling of protosw(9) and IP wire protocols in 78b1fc05b2 for
IPv4 we got vector ip_ctlprotox[] that is executed only and only from
icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed
only and only from icmp6_input().  This allows to use protocol specific
argument types in these methods instead of struct sockaddr and void.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36727
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
46ddeb6be8 netinet6: retire ip6protosw.h
The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d2 in 2001 and the latter survived until
today.  It has been reduced down to only one useful declaration that
moves to ip6_var.h

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36726
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
0ab46f28dc tcp: remove unnecessary include of tcp6_var.h
Reviewed by:		rscheff, melifaro
Differential revision:	https://reviews.freebsd.org/D36725
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
bb77f0c204 udp: typedef udp tunneling functions to functions, not pointers
With this change one can make a forward declaration of a function
that is of UDP tunneling type.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36724
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
24b96f35b9 netinet*: move ipproto_register() and co to ip_var.h and ip6_var.h
This is a FreeBSD KPI and belongs to private header not netinet/in.h.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36723
2022-10-03 20:53:04 -07:00
Richard Scheffenegger
4edff766cb tcp: correct simultaneous SYN ECN reaction in RFC3168 mode.
Ensure that an RFC3168 ECN reaction only occurs on non-SYN
segments.

Reviewed By:    	tuexen, #transport
Sponsored by:   	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36867
2022-10-04 00:24:28 +02:00
Richard Scheffenegger
0924ae8f47 tcp: allow window scale and timestamps to be toggled individually
Simple change to allow for the individual toggling of
RFC7323 window scaling and timestamp option.

Reviewed By:    	rrs, tuexen, glebius, guest-ccui, #transport
Sponsored by:   	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36863
2022-10-03 19:21:46 +02:00
Michael Tuexen
2515552e62 tcp: improve handling of SYN-ACK segments in TIMEWAIT state
Only consider segments with the SYN bit set and the ACK bit cleared
as "new connection attempts", which result in re-using a connection
being in TIMEWAIT state. This results in consistent handling of
SYN-ACK segments.

Reviewed by:		rscheff@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36864
2022-10-03 14:46:47 +02:00
Michael Tuexen
f8b5681094 tcp: honor drop_synfin sysctl variable in TIME-WAIT
Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36862
2022-10-03 12:48:30 +02:00
Randall Stewart
08af8aac2a Tcp progress timeout
Rack has had the ability to timeout connections that just sit idle automatically. This
feature of course is off by default and requires the user set it on (though the socket option
has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in
the base stack as well as rack.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36716
2022-09-27 13:38:20 -04:00
Randall Stewart
d1b07f36a2 TCP complete end status work.
The ending of a connection can tell us a lot about what happened i.e. did
it fail to setup, did it timeout, was it a normal close. Often times this is
useful information to help analyze and debug issues. Rack has had
end status for some time but the base stack as not. Lets go a ahead
and add in the missing bits to populate the end status.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36712
2022-09-26 15:20:18 -04:00
Randall Stewart
e5049a1733 TCP rack does not work properly with cubic.
Right now if you use rack with cubic (the new default cc) you will have
improper results. This is because rack uses different variables than
the base stack (or bbr) and thus tcp_compute_pipe() always returns
so that cubic will choose a 30% backoff not the 50% backoff it should
when it is newreno compatibility mode. The fix is to allow a stack (rack)
to override its own compute_pipe.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36711
2022-09-26 15:12:03 -04:00
Alexander V. Chernikov
f375bf0e6f netinet: pass cred instead of the curthread to ifaddr manipulation funcs.
Pass the credentials directly to the functions, so non-ioctl kernel
 users can also performan address manipulations.

MFC after:	2 weeks
2022-09-26 13:46:13 +00:00
Michael Tuexen
0fdc247274 tcp: make RACK loadable again using the default configuration
Without this patch, loading the RACK stack required the newreno
CC module to be compiled into the kernel. This is not the case
anymore since CUBIC is the default now.

Reviewed by:		rscheff@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36707
2022-09-26 12:30:50 +02:00
Richard Scheffenegger
a743fc8826 tcp: fix cwnd restricted SACK retransmission loop
While doing the initial SACK retransmission segment while heavily cwnd
constrained, tcp_ouput can erroneously send out the entire sendbuffer
again. This may happen after an retransmission timeout, which resets
snd_nxt to snd_una while the SACK scoreboard is still populated.

Reviewed By:		tuexen, #transport
PR:			264257
PR:			263445
PR:			260393
MFC after:		3 days
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36637
2022-09-22 13:28:43 +02:00
Michael Tuexen
5ae83e0d87 tcp: send ACKs when requested
When doing Limited Transmit send an ACK when needed by the protocol
processing (like sending ACKs with a DSACK block).

PR:			264257
PR:			263445
PR:			260393
Reviewed by:		rscheff@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36631
2022-09-22 12:12:11 +02:00
Gleb Smirnoff
9453ec6619 tcp: increment tcpstats in tcp_respond()
tcp_respond() crafts a packet and sends it directly to ip[6]output(),
bypassing tcp_output().  Hence it must increment TCP send statistics.

Reviewed by:		rscheff, tuexen, rrs (implicitly)
Differential revision:	https://reviews.freebsd.org/D36641
2022-09-21 14:03:33 -07:00
Gleb Smirnoff
493105c2a8 tcp: fix simultaneous open and refine e80062a2d4
- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
  is also necessary for a half-synchronized connection.  Fix that
  just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
  soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet.  For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete.  The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup().  Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case.  However, this doesn't yet work.

Reviewed by:		rscheff, tuexen, rrs
Differential revision:	https://reviews.freebsd.org/D36641
2022-09-21 14:02:49 -07:00
Gleb Smirnoff
0c7f3ae8c6 tcpcb: fix tabulation count in i4012ef7754c and abbreviate "packets"
This lines up comments to the rest of the file.  Abbreviation
helps to fit in to 80 char terminal.  Not a functional change.
2022-09-19 10:29:53 -07:00
Michael Tuexen
6d9e911fba tcp: fix computation of offset
Only update the offset if actually retransmitting from the
scoreboard. If not done correctly, this may result in
trying to (re)-transmit data not being being in the socket
buffe and therefore resulting in a panic.

PR:			264257
PR:			263445
PR:			260393
Reviewed by:		rscheff@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36626
2022-09-19 12:49:31 +02:00
Gleb Smirnoff
da6715bbb1 ip_output: always increase "cantfrag" stat if ip_fragment() fails
While here, join two unlikely cases into one if clause.

Submitted by:		Ivan Rozhuk <rozhuk.im gmail.com>
PR:			265718
Reviewed by:		mjg, melifaro
Differential revision:	https://reviews.freebsd.org/D36584
2022-09-14 19:22:40 -07:00
Gleb Smirnoff
15b73a2a14 ip_reass: use correct comparison in ipreass_callout()
Reported-by:	syzbot+55415dc73f9b89b87fce@syzkaller.appspotmail.com
2022-09-14 08:32:07 -07:00
Richard Scheffenegger
bb1d472d79 tcp: make CUBIC the default congestion control mechanism.
This changes the default TCP Congestion Control (CC) to CUBIC.
For small, transactional exchanges (e.g. web objects <15kB), this
will not have a material effect. However, for long duration data
transfers, CUBIC allocates a slightly higher fraction of the
available bandwidth, when competing against NewReno CC.

Reviewed By: tuexen, mav, #transport, guest-ccui, emaste
Relnotes: Yes
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36537
2022-09-13 12:09:21 +02:00
Richard Scheffenegger
ea6d0de299 tcp: Make all references to CUBIC uppercase
Consistently refer to the CUBIC congestion control
mechanism in uppercase throughout all comments.

No functional change.

Reviewed By: #transport, tuexen, mav, guest-ccui, emaste
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36547
2022-09-13 12:07:06 +02:00
Dag-Erling Smørgrav
c198adf394 siftr: spell PFIL_PASS correctly.
Sponsored by:	NetApp
Sponsored by:	Klara Inc.
Differential Revision: https://reviews.freebsd.org/D36539
2022-09-12 19:20:10 +02:00
Mateusz Guzik
1760a6950a Fixup build after recent getsock changes 2022-09-10 20:40:43 +00:00
Mateusz Guzik
3212ad15ab Add getsock
All but one consumers of getsock_cap only pass 4 arguments.
Take advantage of it.
2022-09-10 19:47:47 +00:00
Gleb Smirnoff
29b4b63c59 ip_reass: optimize ipreass_drain_vnet()
- Call ipreass_reschedule() only once per slot [1]
- Aggregate stats and update them once

Suggested by:	jtl [1]
2022-09-10 02:17:15 -07:00
Gleb Smirnoff
13018bfae8 ip_reass: make stray callout assertion more verbose
Syzcaller hits this assertion, but can't find reproducer.  I also never
seen it hit in my testing.  Try to get more information via syzcaller.
2022-09-10 02:11:39 -07:00
Gleb Smirnoff
c8bc874172 ip_reass: fixup the just added tunable
- Don't use hardcoded hash mask
- free the memory on VNET destroy

Fixes:	1494f4776a
2022-09-09 09:19:39 -07:00
Randall Stewart
81560c5582 TCP: Rack ends up sending all that is outstanding every timeout.
In doing some testing for a different problem, I have found rack retransmitting
all outstanding data every time a timeout occurs. The outstanding is sent 1ms
apart between each packet, and then the timeout runs off again. This causes
extra retransmissions when we should be waiting for an ack after sending the
very first segment.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36494
2022-09-09 08:59:21 -04:00
Gleb Smirnoff
1494f4776a ip_reass: add loader tunable to tune the reassembly hash size 2022-09-08 13:49:58 -07:00
Gleb Smirnoff
a30cb31589 ip_reass: retire ipreass_slowtimo() in favor of per-slot callout
o Retire global always running ipreass_slowtimo().
o Instead use one callout entry per hash slot.  The per-slot callout
  would be scheduled only if a slot has entries, and would be driven
  by TTL of the very last entry.
o Make net.inet.ip.fragttl read/write and document it.
o Retire IPFRAGTTL, which used to be meaningful only with PR_SLOWTIMO.

Differential revision:	https://reviews.freebsd.org/D36275
2022-09-08 13:49:58 -07:00
Mateusz Guzik
dda6376b04 net: employ newly added pfil_mbuf_{in,out} where approriate
Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36454
2022-09-08 16:21:08 +00:00
Gleb Smirnoff
e80062a2d4 tcp: avoid call to soisconnected() on transition to ESTABLISHED
This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place.  After 6f3caa6d81 it definitely
became necessary always and commit message from f1ee30ccd6 confirms that.
Now that 6f3caa6d81 is effectively backed out by 07285bb4c2, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36488
2022-09-08 09:16:04 -07:00
Mateusz Guzik
14c9a2dbfb net: retire PFIL_FWD
It is now unused and not having it allows further clean ups.

Reviewed by:	cy, glebius, kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36452
2022-09-07 10:04:31 +00:00
Mateusz Guzik
223a73a1c4 net: remove stale altq_input reference
Code setting it was removed in:
commit 325fab802e
Author: Eric van Gyzen <vangyzen@FreeBSD.org>
Date:   Tue Dec 4 23:46:43 2018 +0000

    altq: remove ALTQ3_COMPAT code

Reviewed by:	glebius, kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36471
2022-09-07 10:03:12 +00:00
Gleb Smirnoff
aa74cc6d6f divert(4): do not depend on ipfw(4)
Although originally socket was intended to use with ipfw(4) only, now
it also can be used with pf(4).  On a kernel without packet filters,
it still can be used to inject traffic.
2022-09-06 20:54:57 -07:00
Gleb Smirnoff
999c9fd733 divert(4): don't check for CSUM_SCTP without INET
This compiles, but actually is a dead code.

Noticed by:	bz
Fixes:		e72c522858
2022-09-06 20:54:57 -07:00
Gleb Smirnoff
0773b44e82 tcp: tcp6_connect() requires net epoch
PR:			262663
Reported & tested by:	dch
MFC after:		2 weeks
2022-09-05 10:19:11 -07:00
Gordon Bergling
347b1991b0 netdump(4): Correct a typo in source code comment
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:59:29 +02:00
Gordon Bergling
c3679af313 tcp_rack: Correct some typos in source code comments
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:58:13 +02:00