Commit Graph

7703 Commits

Author SHA1 Message Date
Gleb Smirnoff
7f3b00a87a netinet: filter out invalid ICMP responses in ip_icmp()
instead of doing that in every ipproto_ctlinput_t method.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36728
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
53807a8a27 netinet*: use sparse C99 initializer for inetctlerrmap
and mark those PRC_* codes, that are used.  The rest are dead code.
This is not a functional change, but illustrative to make easier
review of following changes.
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
43d39ca7e5 netinet*: de-void control input IP protocol methods
After decoupling of protosw(9) and IP wire protocols in 78b1fc05b2 for
IPv4 we got vector ip_ctlprotox[] that is executed only and only from
icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed
only and only from icmp6_input().  This allows to use protocol specific
argument types in these methods instead of struct sockaddr and void.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36727
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
46ddeb6be8 netinet6: retire ip6protosw.h
The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d2 in 2001 and the latter survived until
today.  It has been reduced down to only one useful declaration that
moves to ip6_var.h

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36726
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
0ab46f28dc tcp: remove unnecessary include of tcp6_var.h
Reviewed by:		rscheff, melifaro
Differential revision:	https://reviews.freebsd.org/D36725
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
bb77f0c204 udp: typedef udp tunneling functions to functions, not pointers
With this change one can make a forward declaration of a function
that is of UDP tunneling type.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36724
2022-10-03 20:53:04 -07:00
Gleb Smirnoff
24b96f35b9 netinet*: move ipproto_register() and co to ip_var.h and ip6_var.h
This is a FreeBSD KPI and belongs to private header not netinet/in.h.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36723
2022-10-03 20:53:04 -07:00
Richard Scheffenegger
4edff766cb tcp: correct simultaneous SYN ECN reaction in RFC3168 mode.
Ensure that an RFC3168 ECN reaction only occurs on non-SYN
segments.

Reviewed By:    	tuexen, #transport
Sponsored by:   	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36867
2022-10-04 00:24:28 +02:00
Richard Scheffenegger
0924ae8f47 tcp: allow window scale and timestamps to be toggled individually
Simple change to allow for the individual toggling of
RFC7323 window scaling and timestamp option.

Reviewed By:    	rrs, tuexen, glebius, guest-ccui, #transport
Sponsored by:   	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36863
2022-10-03 19:21:46 +02:00
Michael Tuexen
2515552e62 tcp: improve handling of SYN-ACK segments in TIMEWAIT state
Only consider segments with the SYN bit set and the ACK bit cleared
as "new connection attempts", which result in re-using a connection
being in TIMEWAIT state. This results in consistent handling of
SYN-ACK segments.

Reviewed by:		rscheff@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36864
2022-10-03 14:46:47 +02:00
Michael Tuexen
f8b5681094 tcp: honor drop_synfin sysctl variable in TIME-WAIT
Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36862
2022-10-03 12:48:30 +02:00
Randall Stewart
08af8aac2a Tcp progress timeout
Rack has had the ability to timeout connections that just sit idle automatically. This
feature of course is off by default and requires the user set it on (though the socket option
has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in
the base stack as well as rack.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36716
2022-09-27 13:38:20 -04:00
Randall Stewart
d1b07f36a2 TCP complete end status work.
The ending of a connection can tell us a lot about what happened i.e. did
it fail to setup, did it timeout, was it a normal close. Often times this is
useful information to help analyze and debug issues. Rack has had
end status for some time but the base stack as not. Lets go a ahead
and add in the missing bits to populate the end status.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36712
2022-09-26 15:20:18 -04:00
Randall Stewart
e5049a1733 TCP rack does not work properly with cubic.
Right now if you use rack with cubic (the new default cc) you will have
improper results. This is because rack uses different variables than
the base stack (or bbr) and thus tcp_compute_pipe() always returns
so that cubic will choose a 30% backoff not the 50% backoff it should
when it is newreno compatibility mode. The fix is to allow a stack (rack)
to override its own compute_pipe.

Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36711
2022-09-26 15:12:03 -04:00
Alexander V. Chernikov
f375bf0e6f netinet: pass cred instead of the curthread to ifaddr manipulation funcs.
Pass the credentials directly to the functions, so non-ioctl kernel
 users can also performan address manipulations.

MFC after:	2 weeks
2022-09-26 13:46:13 +00:00
Michael Tuexen
0fdc247274 tcp: make RACK loadable again using the default configuration
Without this patch, loading the RACK stack required the newreno
CC module to be compiled into the kernel. This is not the case
anymore since CUBIC is the default now.

Reviewed by:		rscheff@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36707
2022-09-26 12:30:50 +02:00
Richard Scheffenegger
a743fc8826 tcp: fix cwnd restricted SACK retransmission loop
While doing the initial SACK retransmission segment while heavily cwnd
constrained, tcp_ouput can erroneously send out the entire sendbuffer
again. This may happen after an retransmission timeout, which resets
snd_nxt to snd_una while the SACK scoreboard is still populated.

Reviewed By:		tuexen, #transport
PR:			264257
PR:			263445
PR:			260393
MFC after:		3 days
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36637
2022-09-22 13:28:43 +02:00
Michael Tuexen
5ae83e0d87 tcp: send ACKs when requested
When doing Limited Transmit send an ACK when needed by the protocol
processing (like sending ACKs with a DSACK block).

PR:			264257
PR:			263445
PR:			260393
Reviewed by:		rscheff@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36631
2022-09-22 12:12:11 +02:00
Gleb Smirnoff
9453ec6619 tcp: increment tcpstats in tcp_respond()
tcp_respond() crafts a packet and sends it directly to ip[6]output(),
bypassing tcp_output().  Hence it must increment TCP send statistics.

Reviewed by:		rscheff, tuexen, rrs (implicitly)
Differential revision:	https://reviews.freebsd.org/D36641
2022-09-21 14:03:33 -07:00
Gleb Smirnoff
493105c2a8 tcp: fix simultaneous open and refine e80062a2d4
- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
  is also necessary for a half-synchronized connection.  Fix that
  just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
  soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.

Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet.  For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete.  The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup().  Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case.  However, this doesn't yet work.

Reviewed by:		rscheff, tuexen, rrs
Differential revision:	https://reviews.freebsd.org/D36641
2022-09-21 14:02:49 -07:00
Gleb Smirnoff
0c7f3ae8c6 tcpcb: fix tabulation count in i4012ef7754c and abbreviate "packets"
This lines up comments to the rest of the file.  Abbreviation
helps to fit in to 80 char terminal.  Not a functional change.
2022-09-19 10:29:53 -07:00
Michael Tuexen
6d9e911fba tcp: fix computation of offset
Only update the offset if actually retransmitting from the
scoreboard. If not done correctly, this may result in
trying to (re)-transmit data not being being in the socket
buffe and therefore resulting in a panic.

PR:			264257
PR:			263445
PR:			260393
Reviewed by:		rscheff@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36626
2022-09-19 12:49:31 +02:00
Gleb Smirnoff
da6715bbb1 ip_output: always increase "cantfrag" stat if ip_fragment() fails
While here, join two unlikely cases into one if clause.

Submitted by:		Ivan Rozhuk <rozhuk.im gmail.com>
PR:			265718
Reviewed by:		mjg, melifaro
Differential revision:	https://reviews.freebsd.org/D36584
2022-09-14 19:22:40 -07:00
Gleb Smirnoff
15b73a2a14 ip_reass: use correct comparison in ipreass_callout()
Reported-by:	syzbot+55415dc73f9b89b87fce@syzkaller.appspotmail.com
2022-09-14 08:32:07 -07:00
Richard Scheffenegger
bb1d472d79 tcp: make CUBIC the default congestion control mechanism.
This changes the default TCP Congestion Control (CC) to CUBIC.
For small, transactional exchanges (e.g. web objects <15kB), this
will not have a material effect. However, for long duration data
transfers, CUBIC allocates a slightly higher fraction of the
available bandwidth, when competing against NewReno CC.

Reviewed By: tuexen, mav, #transport, guest-ccui, emaste
Relnotes: Yes
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36537
2022-09-13 12:09:21 +02:00
Richard Scheffenegger
ea6d0de299 tcp: Make all references to CUBIC uppercase
Consistently refer to the CUBIC congestion control
mechanism in uppercase throughout all comments.

No functional change.

Reviewed By: #transport, tuexen, mav, guest-ccui, emaste
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36547
2022-09-13 12:07:06 +02:00
Dag-Erling Smørgrav
c198adf394 siftr: spell PFIL_PASS correctly.
Sponsored by:	NetApp
Sponsored by:	Klara Inc.
Differential Revision: https://reviews.freebsd.org/D36539
2022-09-12 19:20:10 +02:00
Mateusz Guzik
1760a6950a Fixup build after recent getsock changes 2022-09-10 20:40:43 +00:00
Mateusz Guzik
3212ad15ab Add getsock
All but one consumers of getsock_cap only pass 4 arguments.
Take advantage of it.
2022-09-10 19:47:47 +00:00
Gleb Smirnoff
29b4b63c59 ip_reass: optimize ipreass_drain_vnet()
- Call ipreass_reschedule() only once per slot [1]
- Aggregate stats and update them once

Suggested by:	jtl [1]
2022-09-10 02:17:15 -07:00
Gleb Smirnoff
13018bfae8 ip_reass: make stray callout assertion more verbose
Syzcaller hits this assertion, but can't find reproducer.  I also never
seen it hit in my testing.  Try to get more information via syzcaller.
2022-09-10 02:11:39 -07:00
Gleb Smirnoff
c8bc874172 ip_reass: fixup the just added tunable
- Don't use hardcoded hash mask
- free the memory on VNET destroy

Fixes:	1494f4776a
2022-09-09 09:19:39 -07:00
Randall Stewart
81560c5582 TCP: Rack ends up sending all that is outstanding every timeout.
In doing some testing for a different problem, I have found rack retransmitting
all outstanding data every time a timeout occurs. The outstanding is sent 1ms
apart between each packet, and then the timeout runs off again. This causes
extra retransmissions when we should be waiting for an ack after sending the
very first segment.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36494
2022-09-09 08:59:21 -04:00
Gleb Smirnoff
1494f4776a ip_reass: add loader tunable to tune the reassembly hash size 2022-09-08 13:49:58 -07:00
Gleb Smirnoff
a30cb31589 ip_reass: retire ipreass_slowtimo() in favor of per-slot callout
o Retire global always running ipreass_slowtimo().
o Instead use one callout entry per hash slot.  The per-slot callout
  would be scheduled only if a slot has entries, and would be driven
  by TTL of the very last entry.
o Make net.inet.ip.fragttl read/write and document it.
o Retire IPFRAGTTL, which used to be meaningful only with PR_SLOWTIMO.

Differential revision:	https://reviews.freebsd.org/D36275
2022-09-08 13:49:58 -07:00
Mateusz Guzik
dda6376b04 net: employ newly added pfil_mbuf_{in,out} where approriate
Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36454
2022-09-08 16:21:08 +00:00
Gleb Smirnoff
e80062a2d4 tcp: avoid call to soisconnected() on transition to ESTABLISHED
This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place.  After 6f3caa6d81 it definitely
became necessary always and commit message from f1ee30ccd6 confirms that.
Now that 6f3caa6d81 is effectively backed out by 07285bb4c2, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36488
2022-09-08 09:16:04 -07:00
Mateusz Guzik
14c9a2dbfb net: retire PFIL_FWD
It is now unused and not having it allows further clean ups.

Reviewed by:	cy, glebius, kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36452
2022-09-07 10:04:31 +00:00
Mateusz Guzik
223a73a1c4 net: remove stale altq_input reference
Code setting it was removed in:
commit 325fab802e
Author: Eric van Gyzen <vangyzen@FreeBSD.org>
Date:   Tue Dec 4 23:46:43 2018 +0000

    altq: remove ALTQ3_COMPAT code

Reviewed by:	glebius, kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D36471
2022-09-07 10:03:12 +00:00
Gleb Smirnoff
aa74cc6d6f divert(4): do not depend on ipfw(4)
Although originally socket was intended to use with ipfw(4) only, now
it also can be used with pf(4).  On a kernel without packet filters,
it still can be used to inject traffic.
2022-09-06 20:54:57 -07:00
Gleb Smirnoff
999c9fd733 divert(4): don't check for CSUM_SCTP without INET
This compiles, but actually is a dead code.

Noticed by:	bz
Fixes:		e72c522858
2022-09-06 20:54:57 -07:00
Gleb Smirnoff
0773b44e82 tcp: tcp6_connect() requires net epoch
PR:			262663
Reported & tested by:	dch
MFC after:		2 weeks
2022-09-05 10:19:11 -07:00
Gordon Bergling
347b1991b0 netdump(4): Correct a typo in source code comment
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:59:29 +02:00
Gordon Bergling
c3679af313 tcp_rack: Correct some typos in source code comments
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:58:13 +02:00
Gordon Bergling
893f36b7f1 netinet: Correct a typo in source code comment
- s/occured/occurred/

MFC after:	3 days
2022-09-04 12:57:12 +02:00
Gordon Bergling
d07a501876 tcp_hpts: Correct some typos in source code comments
- s/occured/occurred/
- s/the the/the/

MFC after:	3 days
2022-09-04 12:47:49 +02:00
Gordon Bergling
fa52f9dc9a tcp_rack: Fix two typos in source code comments
- s/overriden/overridden/

MFC after:	3 days
2022-09-03 15:05:42 +02:00
Gleb Smirnoff
74ed2e8ab2 raw ip: fix regression with multicast and RSVP
With 61f7427f02 raw sockets protosw has wildcard pr_protocol.  Protocol
of a specific pcb is stored in inp_ip_p.

Reviewed by:		karels
Reported by:		karels
Differential revision:	https://reviews.freebsd.org/D36429
Fixes:			61f7427f02
2022-09-02 12:17:09 -07:00
Richard Scheffenegger
4012ef7754 tcp: Functional implementation of Accurate ECN
The AccECN handshake and TCP header flags are supported,
no support yet for the AccECN option. This minimalistic
implementation is sufficient to support DCTCP while
dramatically cutting the number of ACKs, and provide ECN
response from the receiver to the CC modules.

Reviewed By:		#transport, #manpages, rrs, pauamma
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D21011
2022-08-31 15:05:53 +02:00
Richard Scheffenegger
c21b7b55be tcp: finish SACK loss recovery on sudden lack of SACK blocks
While a receiver should continue sending SACK blocks for the
duration of a SACK loss recovery, if for some reason the
TCP options no longer contain these SACK blocks, but we
already started maintaining the Scoreboard, keep on handling
incoming ACKs (without SACK) as belonging to the SACK recovery.

Reported by:		thj
Reviewed by:		tuexen, #transport
MFC after:		2 weeks
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D36046
2022-08-31 14:49:47 +02:00
Gleb Smirnoff
e72c522858 divert(4): make it compilable and working without INET
Differential revision:	https://reviews.freebsd.org/D36383
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
f1fb051716 divert(4): maintain own cb database and stop using inpcb KPI
Here go cons of using inpcb for divert:
- divert(4) uses only 16 bits (local port) out of struct inpcb,
  which is 424 bytes today.
- The inpcb KPI isn't able to provide hashing for divert(4),
  thus it uses global inpcb list for lookups.
- divert(4) uses INET-specific part of the KPI, making INET
  a requirement for IPDIVERT.

Maintain our own very simple hash lookup database instead.  It
has mutex protection for write and epoch protection for lookups.
Since now so->so_pcb no longer points to struct inpcb, don't
initialize protosw methods to methods that belong to PF_INET.
Also, drop support for setting options on a divert socket.  My
review of software in base and ports confirms that this has no
use and unlikely worked before.

Differential revision:	https://reviews.freebsd.org/D36382
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
2b1c72171e divert(4): provide statistics
Instead of incrementing pretty random counters in the IP statistics,
create divert socket statistics structure.  Export via netstat(1).

Differential revision:	https://reviews.freebsd.org/D36381
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
61f7427f02 protosw: cleanup protocols that existed merely to provide pr_input
Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6).  This story ended with 78b1fc05b2.

These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP).  Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.

With 78b1fc05b2 all these entries are no longer needed, as ip_protox
is independent of protosw.  Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw.  And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.

For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
  of protosw value, which is now always 0.

Differential revision:	https://reviews.freebsd.org/D36380
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
8624f4347e divert: declare PF_DIVERT domain and stop abusing PF_INET
The divert(4) is not a protocol of IPv4.  It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back.  It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols.  The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char.  I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.

Moving divert(4) out of inetsw accomplished two important things:

1) IPDIVERT is getting much closer to be not dependent on INET.
   This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
   Domain/proto selection code won't need a hack for SOCK_RAW and
   multiple entries in inetsw implementing different flavors of
   raw socket can merge into one without requirement of raw IPv4
   being the last member of dom_protosw.

Differential revision:	https://reviews.freebsd.org/D36379
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
c00605751e tcp: remove a dead code leftover from T/TCP,
that doesn't have any value today.
2022-08-29 19:30:12 -07:00
Gleb Smirnoff
8fc8063849 divert: merge div_output() into div_send()
No functional change intended.
2022-08-29 19:15:01 -07:00
Gleb Smirnoff
c414347bc5 mbufs: isolate max_linkhdr and max_protohdr handling in the mbuf code
o Statically initialize max_linkhdr to default value without relying
  on domain(9) code doing that.
o Statically initialize max_protohdr to a sane value, without relying
  on TCP being always compiled in.
o Retire max_datalen. Set, but not used.
o Don't make the domain(9) system responsible in validating these
  values and updating max_hdr.  Instead provide KPI max_linkhdr_grow()
  and max_protohdr_grow().
o Call max_linkhdr_grow() from IEEE802.11 and max_protohdr_grow() from
  TCP.  Those are the only protocols today that may want to grow.

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D36376
2022-08-29 19:14:25 -07:00
Alexander V. Chernikov
7b3440fc30 Revert "routing: install prefix and loopback routes using new nhop-based KPI."
Temporarily revert the commit to unblock testing.

This reverts commit a1b59379db.
2022-08-29 16:20:42 +00:00
Alexander V. Chernikov
a1b59379db routing: install prefix and loopback routes using new nhop-based KPI.
Construct the desired hexthops directly instead of using the
 "translation" layer in form of filling rt_addrinfo data.
Simplify V_rt_add_addr_allfibs handling by using recently-added
 rib_copy_route() to propagate the routes to the non-primary address
 fibs.

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D36166
2022-08-29 10:07:58 +00:00
Michael Tuexen
c624b9a549 tcp: fix stats counter for SYN_RCVD state when TCP-FO is used
Reviewed by:		glebius
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36384
2022-08-28 18:45:59 +02:00
Randall Stewart
62ce18fc9a tcp: Rack rwnd collapse.
Currently when the peer collapses its rwnd, we mark packets to be retransmitted
and use the must_retran flags like we do when a PMTU collapses to retransmit the
collapsed packets. However this causes a problem with some middle boxes that
play with the rwnd to control flow. As soon as the rwnd increases we start resending
which may be not even a rtt.. and in fact the peer may have gotten the packets. Which
means we gratuitously retransmit packets we should not.

The fix here is to make sure that a rack time has passed before retransmitting the packets.
This makes sure that the rwnd collapse was real and the packets do need retransmission.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D35166
2022-08-23 09:17:05 -04:00
Randall Stewart
4e0ce82b53 TCP Lro has a loss of timestamp precision and reorders packets.
A while back Hans optimized the LRO code. This is great but one
optimization he did degrades the timestamp precision so that
all flushed LRO entries end up with the same LRO timestamp
if there is not a hardware timestamp. The intent of the LRO timestamp
is to get as close to the time that the packet arrived as possible. Without
the LRO queuing this works out fine since a binuptime is taken and then
the rx_common code is called. But when you go through the queue path
you end up *not* updating the M_LRO_TSTMP fields.

Another issue in the LRO code is several places that cause packet reordering. In
general TCP can handle reordering but it can cause extra un-needed retransmission
as well as other oddities. We will fix all of the reordering problems.

Lets fix this so that we restore the precision to the timestamp.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36043
2022-08-23 09:12:31 -04:00
Gleb Smirnoff
6498153665 ip_reass: don't drain all vnets on a vnet destroy 2022-08-21 07:44:58 -07:00
Gleb Smirnoff
8338690a0a ip_reass: provide sysctl MIB returning IP fragment TTL
For now it is read-only, but eventually the cycle that goes over
all fragments should be refactored and this MIB should also become
read/write.

This MIB will allow SNMP daemons to implement MIB-II ipReasmTimeout MIB
straightfoward.  Right now net-snmp compilation is broken by 1922eb3e9c.
The base system bsnmpd is not broken just because it ignored PR_SLOWTIMO,
and thus always returned incorrectly doubled value for ipReasmTimeout.
2022-08-20 13:39:12 -07:00
Gleb Smirnoff
e7d02be19d protosw: refactor protosw and domain static declaration and load
o Assert that every protosw has pr_attach.  Now this structure is
  only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw.  This was suggested
  in 1996 by wollman@ (see 7b187005d1), and later reiterated
  in 2006 by rwatson@ (see 6fbb9cf860).
o Make struct domain hold a variable sized array of protosw pointers.
  For most protocols these pointers are initialized statically.
  Those domains that may have loadable protocols have spacers. IPv4
  and IPv6 have 8 spacers each (andre@ dff3237ee5).
o For inetsw and inet6sw leave a comment noting that many protosw
  entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36232
2022-08-17 11:50:32 -07:00
Gleb Smirnoff
d9f6ac882a protosw: retire PRU_ flags and their char names
For many years only TCP debugging used them, but relatively recently
TCP DTrace probes also start to use them.  Move their declarations
into tcp_debug.h, but start including tcp_debug.h unconditionally,
so that compilation with DTrace and without TCPDEBUG is possible.
2022-08-17 11:50:32 -07:00
Gleb Smirnoff
aea0cd0432 ip_reass: separate ipreass_init() into global and VIMAGE parts
Should have been done in 89128ff3e4.
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
a6b982e265 tcp: move tcp_drain() verbatim before tcp_init() 2022-08-17 11:50:31 -07:00
Gleb Smirnoff
81a34d374e protosw: retire pr_drain and use EVENTHANDLER(9) directly
The method was called for two different conditions: 1) the VM layer is
low on pages or 2) one of UMA zones of mbuf allocator exhausted.
This change 2) into a new event handler, but all affected network
subsystems modified to subscribe to both, so this change shall not
bring functional changes under different low memory situations.

There were three subsystems still using pr_drain: TCP, SCTP and frag6.
The latter had its protosw entry for the only reason to register its
pr_drain method.

Reviewed by:		tuexen, melifaro
Differential revision:	https://reviews.freebsd.org/D36164
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
b730de8bad mld6: use callout(9) directly instead of pr_slowtimo, pr_fasttimo
While here remove recursive network epoch entry in mld_fasttimo_vnet(),
as this function is already in epoch.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36161
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
0ce4d7ec96 igmp: use callout(9) directly instead of pr_slowtimo, pr_fasttimo
Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36160
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
6c452841ef tcp: use callout(9) directly instead of pr_slowtimo
Modern TCP stacks uses multiple callouts per tcpcb, and a global
callout is ancient artifact.  However it is still used to garbage
collect compressed timewait entries.

Reviewed by:		melifaro, tuexen
Differential revision:	https://reviews.freebsd.org/D36159
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
160f01f09f ip_reass: use callout(9) directly instead of pr_slowtimo
Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36236
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
78b1fc05b2 protosw: separate pr_input and pr_ctlinput out of protosw
The protosw KPI historically has implemented two quite orthogonal
things: protocols that implement a certain kind of socket, and
protocols that are IPv4/IPv6 protocol.  These two things do not
make one-to-one correspondence. The pr_input and pr_ctlinput methods
were utilized only in IP protocols.  This strange duality required
IP protocols that doesn't have a socket to declare protosw, e.g.
carp(4).  On the other hand developers of socket protocols thought
that they need to define pr_input/pr_ctlinput always, which lead to
strange dead code, e.g. div_input() or sdp_ctlinput().

With this change pr_input and pr_ctlinput as part of protosw disappear
and IPv4/IPv6 get their private single level protocol switch table
ip_protox[] and ip6_protox[] respectively, pointing at array of
ipproto_input_t functions.  The pr_ctlinput that was used for
control input coming from the network (ICMP, ICMPv6) is now represented
by ip_ctlprotox[] and ip6_ctlprotox[].

ipproto_register() becomes the only official way to register in the
table.  Those protocols that were always static and unlikely anybody
is interested in making them loadable, are now registered by ip_init(),
ip6_init().  An IP protocol that considers itself unloadable shall
register itself within its own private SYSINIT().

Reviewed by:		tuexen, melifaro
Differential revision:	https://reviews.freebsd.org/D36157
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
489482e276 ipsec: isolate knowledge about protocols that are last header
Retire PR_LASTHDR protosw flag.

Reviewed by:		ae
Differential revision:	https://reviews.freebsd.org/D36155
2022-08-17 08:24:28 -07:00
Gleb Smirnoff
c93db4abf4 udp: call UDP methods from UDP over IPv6 directly
Both UDP and UDP Lite use same methods on sockets.  Both UDP over IPv4
and over IPv6 use same methods.  Don't pretend that methods can switch
and remove this unneeded complexity.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36154
2022-08-16 12:40:36 -07:00
Dimitry Andric
b33bfe6e15 Fix unused variable warnings in tcp_hpts.c
With clang 15, the following -Werror warning is produced:

    sys/netinet/tcp_hpts.c:1114:10: error: variable 'paced_cnt' set but not used [-Werror,-Wunused-but-set-variable]
            int32_t paced_cnt = 0;
                    ^
    sys/netinet/tcp_hpts.c:1112:11: error: variable 'total_slots_processed' set but not used [-Werror,-Wunused-but-set-variable]
            uint64_t total_slots_processed = 0;
                     ^

The 'paced_cnt' variable was in tcp_hpts.c when it was first added, and
the 'total_slots_processed' variable was added in d7955cc0ff, but
both appear to have been debugging aids that have never been used, so
remove them.

MFC after:	3 days
2022-08-15 20:48:34 +02:00
Dimitry Andric
db6b32867d Adjust function definition in tcp_hpts.c to avoid clang 15 warning
With clang 15, the following -Werror warning is produced:

    sys/netinet/tcp_hpts.c:1594:23: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    tcp_choose_hpts_to_run()
                          ^
                           void

This is because tcp_choose_hpts_to_run() is declared with a (void)
argument list, but defined with an empty argument list. Make the
definition match the declaration.

MFC after:	3 days
2022-08-15 20:48:33 +02:00
Dimitry Andric
57cdd13d07 Suppress unused variable warning in tcp_stacks's rack.c
With clang 15, the following -Werror warning is produced:

    sys/netinet/tcp_stacks/rack.c:17405:12: error: variable 'outstanding' set but not used [-Werror,-Wunused-but-set-variable]
                    uint32_t outstanding;
                             ^

The 'outstanding' variable was used later in the rack_output() function,
but refactoring in 35c7bb3407 removed the usage. To avoid too much
code churn, mark the variable unused to supress the warning.

MFC after:	3 days
2022-08-14 21:27:35 +02:00
Dimitry Andric
e967183cb0 Fix unused variable warning in tcp_stacks's rack.c
With clang 15, the following -Werror warning is produced:

    sys/netinet/tcp_stacks/rack.c:16148:6: error: variable 'cnt_thru' set but not used [-Werror,-Wunused-but-set-variable]
            int cnt_thru = 1;
                ^

The 'cnt_thru' variable is only used when TCP_ACCOUNTING is defined.
Ensure it is only declared and set in that case.

MFC after:	3 days
2022-08-14 21:27:34 +02:00
Dimitry Andric
7624896571 Fix unused variable warning in tcp_stacks's bbr.c
With clang 15, the following -Werror warning is produced:

sys/netinet/tcp_stacks/bbr.c:11925:11: error: variable 'rtr_cnt' set but not used [-Werror,-Wunused-but-set-variable]
        uint32_t rtr_cnt = 0;
                 ^

The 'rtr_cnt' variable was in bbr.c when it was first added, but it
appears to have been a debugging aid that has never been used, so remove
it.

MFC after:	3 days
2022-08-14 21:27:34 +02:00
Andrew Gallatin
2c6ff1d632 LRO: fix BPF filters for lagg in the hpts path
When in the hpts path, we need to handle BPF filters since aggregated
packets do not pass up the stack in the normal way. This is already
done for most interfaces, but lagg needs special handling. This is
because packets received via a lagg are passed up the stack with
the leaf interface's ifp stored in m_pkthdr.rcvif.

To handle lagg packets, we must identify that the passed rcvif is
currently a lagg port by checking for IFT_IEEE8023ADLAG or
IFT_INFINIBANDLAG (since lagg changes the lagg port's type to that
when an interface becomes a lagg member). Then we need to find the
lagg's ifp, and handle any BPF listeners on the lagg.

Note: It is possible to have multiple BPF filters, one on a member
port and one on the lagg itself. That is why we have to have 2
checks and 2 ETHER_BPF_MTAPs.

Reviewed by: jhb, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D36136
2022-08-13 17:33:36 -04:00
Gleb Smirnoff
f277746e13 protosw: change prototype for pr_control
For some reason protosw.h is used during world complation and userland
is not aware of caddr_t, a relic from the first version of C.  Broken
buildworld is good reason to get rid of yet another caddr_t in kernel.

Fixes:	886fc1e804
2022-08-12 12:08:18 -07:00
Gleb Smirnoff
948f31d7b0 netinet: do not broadcast PRC_REDIRECT_HOST on ICMP redirect
This is expensive and useless call.  It has been useless since Alexander
melifaro@ moved the forwarding table to nexthops with passive invalidation.
What happens now is that cached route in a inpcb would get invalidated
on next ip_output().

These were the last users of pfctlinput(), so garbage collect it.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36156
2022-08-12 08:31:29 -07:00
Gleb Smirnoff
3d2041c035 raw ip: merge rip_output() into rip_send()
While here, address the unlocked 'dst' read.  Solve that by storing
a pointer either to the inpcb or to the sockaddr.  If we end up
copying address out of the inpcb, that would be done under the read
lock section.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36127
2022-08-11 09:19:37 -07:00
Gleb Smirnoff
8c77967ecc protosw: retire pr_output method
The only place to execute this method was raw_usend(). Only those
protocols that used raw socket were able to actually enter that method.
All pr_output assignments being deleted by this commit were a dead code
for many years.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36126
2022-08-11 09:19:37 -07:00
Gleb Smirnoff
b8103ca76d netinet: get interface event notifications directly via EVENTHANDLER(9)
The old mechanism of getting them via domains/protocols control input
is a relict from the previous century, when nothing like EVENTHANDLER(9)
existed yet.  Retire PRC_IFDOWN/PRC_IFUP as netinet was the only one
to use them.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36116
2022-08-11 09:19:36 -07:00
Gleb Smirnoff
07285bb4c2 tcp: utilize new solisten_clone() and solisten_enqueue()
This streamlines cloning of a socket from a listener.  Now we do not
drop the inpcb lock during creation of a new socket, do not do useless
state transitions, and put a fully initialized socket+inpcb+tcpcb into
the listen queue.

Before this change, first we would allocate the socket and inpcb+tcpcb via
tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock
pcb and put this onto incomplete queue (see 6f3caa6d81).  Then, after
sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED,
insert into inpcb hash, finalize initialization of tcpcb.  And then, in
call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call
soisconnected().  This call would lock the listening socket once again
with a LOR protection sequence and then we would relocate the socket onto
the complete queue and only now it is ready for accept(2).

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36064
2022-08-10 11:09:34 -07:00
Gleb Smirnoff
c7a62c925c inpcb: gather v4/v6 handling code into in_pcballoc() from protocols
Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36062
2022-08-10 11:09:34 -07:00
Gleb Smirnoff
d88eb4654f tcp: address a wire level race with 2 ACKs at the end of TCP handshake
Imagine we are in SYN-RCVD state and two ACKs arrive at the same time,
both valid, e.g. coming from the same host and with valid sequence.

First packet would locate the listening socket in the inpcb database,
write-lock it and start expanding the syncache entry into a socket.
Meanwhile second packet would wait on the write lock of the listening
socket.  First packet will create a new ESTABLISHED socket, free the
syncache entry and unlock the listening socket.  Second packet would
call into syncache_expand(), but this time it will fail as there
is no syncache entry.  Second packet would generate RST, effectively
resetting the remote connection.

It seems to me, that it is impossible to solve this problem with
just rearranging locks, as the race happens at a wire level.

To solve the problem, for an ACK packet arrived on a listening socket,
that failed syncache lookup, perform a second non-wildcard lookup right
away.  That lookup may find the new born socket.  Otherwise, we indeed
send RST.

Tested by:		kp
Reviewed by:		tuexen, rrs
PR:			265154
Differential revision:	https://reviews.freebsd.org/D36066
2022-08-10 07:32:37 -07:00
Michael Tuexen
bd30a1216e tcp: improve BBLog for output events when using the FreeBSD stack
Put the return value of ip_output()/ip6_output in the output event
instead of adding another one in case of an error. This improves
consistency with other similar places.

Reviewed by:		rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D36085
2022-08-08 13:07:10 +02:00
Michael Tuexen
bb995f2ef0 sctp: improve handling of send() calls with no user data`
In particular, don't report EAGAIN on send() calls with no user
data, which might trigger a KASSERT in asyc IO.

Reported by:	syzbot+3b4dc5d1d63e9bd01eda@syzkaller.appspotmail.com
MFC after:	1 week
2022-08-08 12:53:42 +02:00
Gleb Smirnoff
e7231d07a4 tcp_input: update comment to match reality. 2022-08-07 11:18:30 -07:00
Michael Tuexen
979bc32c7c sctp: tweak panic message
MFC after:	1 week
2022-08-03 17:28:15 +02:00
Mike Karels
637f317c6d IPv6: fix problem with duplicate port assignment with v4-mapped addrs
In in_pcb_lport_dest(), if an IPv6 socket does not match any other IPv6
socket using in6_pcblookup_local(), and if the socket can also connect
to IPv4 (the INP_IPV4 vflag is set), check for IPv4 matches as well.
Otherwise, we can allocate a port that is used by an IPv4 socket
(possibly one created from IPv6 via the same procedure), and then
connect() can fail with EADDRINUSE, when it could have succeeded if
the bound port was not in use.

PR:		265064
Submitted by:	firk at cantconnect.ru (with modifications)
Reviewed by:	bz, melifaro
Differential Revision: https://reviews.freebsd.org/D36012
2022-08-02 09:49:46 -05:00
Michael Tuexen
1abc27dd52 tcp rack: simplify computation of rsm start and end
While there, also fix the setting of the SYN related flag.

Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D35862
2022-08-02 12:45:56 +02:00
Alexander V. Chernikov
ae6bfd12c8 routing: refactor private KPI
* Make nhgrp_get_nhops() return const struct weightened_nhop to
 indicate that the list is immutable
* Make nhgrp_get_group() return the actual group, instead of
 group+weight.

MFC after:	2 weeks
2022-08-01 10:02:12 +00:00
Alexander V. Chernikov
800c68469b routing: add nhop(9) kpi.
Differential Revision: https://reviews.freebsd.org/D35985
MFC after:	1 month
2022-08-01 08:52:26 +00:00
Dimitry Andric
24e13a49fa Adjust sctp_drain() definition to avoid clang 15 warning
With clang 15, the following -Werror warning is produced:

    sys/netinet/sctp_pcb.c:6946:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    sctp_drain()
              ^
               void

This is because sctp_drain() is declared with a (void) argument list,
but defined with an empty argument list. Make the definition match the
declaration.

MFC after:	3 days
2022-07-26 19:59:55 +02:00
Dimitry Andric
2057985649 Adjust sctp_init_sysctls() definition to avoid clang 15 warning
With clang 15, the following -Werror warning is produced:

    sys/netinet/sctp_sysctl.c:55:18: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    sctp_init_sysctls()
                     ^
                      void

This is because sctp_init_sysctls() is declared with a (void) argument
list, but defined with an empty argument list. Make the definition match
the declaration.

MFC after:	3 days
2022-07-25 22:08:35 +02:00
Dimitry Andric
5bfd8cf369 Fix unused variable warning in sctp_timer.c
With clang 15, the following -Werror warning is produced:

    sys/netinet/sctp_timer.c:510:6: error: variable 'recovery_cnt' set but not used [-Werror,-Wunused-but-set-variable]
            int recovery_cnt = 0;
                ^

The 'recovery_cnt' variable is only used when INVARIANTS is undefined.
Ensure it is only declared and set in that case.

MFC after:	3 days
2022-07-25 22:08:28 +02:00
Dimitry Andric
9057feddc4 Fix unused variable warning in sctp_output.c
With clang 15, the following -Werror warning is produced:

    sys/netinet/sctp_output.c:9367:33: error: variable 'cnt_thru' set but not used [-Werror,-Wunused-but-set-variable]
            int no_fragmentflg, bundle_at, cnt_thru;
                                           ^

The 'cnt_thru' variable was in sctp_output.c when it was first added,
but appears to have been a debugging aid that has never been used, so
remove it.

MFC after:	3 days
2022-07-25 21:50:51 +02:00
Dimitry Andric
05b3a4282c Fix unused variable warnings in sctp_indata.c
With clang 15, the following -Werror warnings are produced:

    sys/netinet/sctp_indata.c:3309:6: error: variable 'tot_retrans' set but not used [-Werror,-Wunused-but-set-variable]
            int tot_retrans = 0;
                ^
    sys/netinet/sctp_indata.c:3842:20: error: variable 'resend' set but not used [-Werror,-Wunused-but-set-variable]
            int inflight = 0, resend = 0, inbetween = 0, acked = 0, above = 0;
                              ^
    sys/netinet/sctp_indata.c:3842:47: error: variable 'acked' set but not used [-Werror,-Wunused-but-set-variable]
            int inflight = 0, resend = 0, inbetween = 0, acked = 0, above = 0;
                                                         ^
    sys/netinet/sctp_indata.c:3842:58: error: variable 'above' set but not used [-Werror,-Wunused-but-set-variable]
            int inflight = 0, resend = 0, inbetween = 0, acked = 0, above = 0;
                                                                    ^

The 'tot_retrans' variable was used in sctp_strike_gap_ack_chunks(), but
refactoring in 493d8e5a83 got rid of it. Remove the variable since it
no longer serves any purpose.

The 'resend', 'acked', and 'above' variables are only used when
INVARIANTS is undefined. Ensure they are only declared and set in that
case.

MFC after:	3 days
2022-07-25 21:16:05 +02:00
Mike Karels
fb8ef16bab IPv4: correct limit on loopback_prefix
Commit efe58855f3 allowed the net.inet.ip.loopback_prefix value
to be 32.  However, with a 32-bit mask, 127.0.0.1 is not included
in the reserved loopback range, which should not be allowed.
Change the max prefix length to 31.
2022-07-21 09:38:17 -05:00
Michael Tuexen
5b741298b1 tcp rack: fix switching to RACK when FIN has been sent
Fix the rack sendmap entry in case a FIN has been sent when the
stack is switched over to RACK.

Reported by:		syzbot+dd55e316428419e9354b@syzkaller.appspotmail.com
Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D35731
2022-07-19 20:28:25 +02:00
Richard Scheffenegger
66605ff791 tcp: Undo the increase in sequence number by 1 due to the FIN flag in case of a transient error.
If an error occurs while processing a TCP segment with some data and the FIN
flag, the back out of the sequence number advance does not take into account the
increase by 1 due to the FIN flag.

Reviewed By: jch, gnn, #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D2970
2022-07-14 03:18:19 +02:00
Mike Karels
efe58855f3 IPv4: experimental changes to allow net 0/8, 240/4, part of 127/8
Combined changes to allow experimentation with net 0/8 (network 0),
240/4 (Experimental/"Class E"), and part of the loopback net 127/8
(all but 127.0/16).  All changes are disabled by default, and can be
enabled by the following sysctls:

    net.inet.ip.allow_net0=1
    net.inet.ip.allow_net240=1
    net.inet.ip.loopback_prefixlen=16

When enabled, the corresponding addresses can be used as normal
unicast IP addresses, both as endpoints and when forwarding.

Add descriptions of the new sysctls to inet.4.

Add <machine/param.h> to vnet.h, as CACHE_LINE_SIZE is undefined in
various C files when in.h includes vnet.h.

The proposals motivating this experimentation can be found in

    https://datatracker.ietf.org/doc/draft-schoen-intarea-unicast-0
    https://datatracker.ietf.org/doc/draft-schoen-intarea-unicast-240
    https://datatracker.ietf.org/doc/draft-schoen-intarea-unicast-127

Reviewed by:	rgrimes, pauamma_gundo.com; previous versions melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D35741
2022-07-13 09:46:05 -05:00
Gleb Smirnoff
aeb6948d43 bbr: check proper flag for connection had been closed
An older version of D35663 slipped through final reviews.

Submitted by:	Peter Lei
Fixes:		74703901d8
2022-07-08 22:04:44 -07:00
Gleb Smirnoff
1b91978f63 tcp: remove a condition in tcp_usr_detach() that never happens
The comment from Robert Watson doubts that this condition ever happens.
Our analysis confirm that.  Also, we found that if you manage to create
such a connection with help of some other bug, then after the "second
case" code is executed, the kernel will panic in other part of the stack.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D35714
2022-07-06 21:09:45 -07:00
Mitchell Horne
258958b3c7 ddb: use _FLAGS command macros where appropriate
Some command definitions were forced to use DB_FUNC in order to specify
their required flags, CS_OWN or CS_MORE. Use the new macros to simplify
these.

Reviewed by:	markj, jhb
MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D35582
2022-07-05 11:56:55 -03:00
Gleb Smirnoff
d8596171c5 sockets: use only soref()/sorele() as socket reference count
o Retire SS_FDREF as it is basically a debug flag on top of already
  existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private.  The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().

Note on listening queues.  Until now the sockets on a queue had zero
reference count.  And the reference were given only upon accept(2).  The
assumption was that there is no way to see the queued socket from anywhere
except its head.  This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists.  With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.

Differential revision:	https://reviews.freebsd.org/D35679
2022-07-04 12:40:51 -07:00
Gleb Smirnoff
74703901d8 tcp: use a TCP flag to check if connection has been close(2)d
The flag SS_NOFDREF is a private flag of the socket layer.  It also
is supposed to be read with SOCK_LOCK(), which we don't own here.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D35663
2022-07-04 12:40:51 -07:00
Gleb Smirnoff
ad3ad06477 blackhole(4): fix operator precedence
Fixes:	3ea9a7cf7b
2022-06-27 17:52:19 -07:00
Michael Tuexen
121ecca0d8 sctp: add KASSERTs to ensure correct handling of listeners
This was suggested by markj@.

MFC after:	3 days
2022-06-27 19:04:45 +02:00
Gleb Smirnoff
bafe71fd27 sctp: do not clobber listening socket with sockbuf operations
The problem was here since 779f106aa1, but a4fc41423f turned it
into a panic.

Reviewed by:	tuexen
Reported by:	syzcaller
2022-06-27 09:24:49 -07:00
Hans Petter Selasky
f5766992c0 tcp: Correctly compute the TCP goodput in bits per second by using SEQ_SUB().
TCP sequence number differences should be computed using SEQ_SUB().

Differential Revision:	https://reviews.freebsd.org/D35505
Reviewed by:	rscheff@
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-06-23 21:10:39 +02:00
Claudio Jeker
97453e5e72 Unlock inp when handling TCP_MD5SIG socket options
Unlock the inp when hanlding TCP_MD5SIG socket options. tcp_ipsec_pcbctl
handles locking the inp when the option is being modified.

This was found by Claudio Jeker while working on the OpenBGPd port.

On 14 we get a panic when trying to call getsockopt, on 13.1 the process
locks up using 100% CPU.

Reviewed by:	rscheff (transport), tuexen
MFC after:	3 days
Sponsored by:	Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D35532
2022-06-23 15:57:56 +01:00
Michael Tuexen
bf6c6162c7 tcp: fix TCPPCAP for kernels enabling VNET
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D35503
2022-06-15 23:28:54 +02:00
Michael Tuexen
ee9ee699d6 sctp: remove book keeping not needed anymore
MFC after:	3 days
2022-06-08 23:30:52 +02:00
Michael Tuexen
ad6ae52d1c sctp: cleanup, no functional change
MFC after:	3 days
2022-06-08 22:35:14 +02:00
Richard Scheffenegger
57317c8971 tcp: exclude KASSERTS when rescue retransmissions are in play.
The KASSERT criteria needs to be checked against the
sendbuffer so_snd in a subsequent version.

Reviewed By:	tuexen, #transport
PR:		263445
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D35431
2022-06-08 14:51:31 +02:00
Richard Scheffenegger
ce2525c810 tcp: remove goto and address another NULL deref in SACK
Missed another NULL dereference during KASSERTS after traversing
the scoreboard. While at it, scratch the goto by making the
traversal conditional, and remove duplicate checks using an
unconditional loop with all checks inside.

Reviewed By:	hselasky
PR:		263445
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D35428
2022-06-08 09:18:32 +02:00
Richard Scheffenegger
231e0dd5d1 tcp: skip sackhole checks on NULL
Inadvertedly introduced NULL pointer dereference during
sackhole sanity check in D35387.

Reviewed By:	glebius
PR:		263445
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D35423
2022-06-07 18:18:42 +02:00
Richard Scheffenegger
91d6afe6e2 tcp: Sanity check of SACK holes on retransmissions
Adding a few KASSERT() to validate sanity of sack holes, and
bail out if sack hole is inconsistent to avoid panicing non-invariant builds.

Reviewed By:	hselasky, glebius
PR:		263445
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D35387
2022-06-07 09:38:16 +02:00
Arseny Smalyuk
81cac3906e ipfw: add support radix tables and table lookup for MAC addresses
By analogy with IP address matching, add a way to use ipfw radix
tables for MAC matching. This is implemented using new ipfw table
with mac:radix type. Also there are src-mac and dst-mac lookup
commands added.

Usage example:
  ipfw table 1 create type mac
  ipfw table 1 add 11:22:33:44:55:66/48
  ipfw add skipto tablearg src-mac 'table(1)'
  ipfw add deny src-mac 'table(1, 100)'
  ipfw add deny lookup dst-mac 1

Note: sysctl net.link.ether.ipfw=1 should be set to enable ipfw
filtering on L2.

Reviewed by:	melifaro
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D35103
2022-06-04 19:12:29 +03:00
Gordon Bergling
32a01b2b86 rack: Fix a common typo in comments and a sysctl description
- s/multipler/multiplier/

MFC after:	3 days
2022-06-04 17:56:56 +02:00
Gordon Bergling
c93db89231 rack: Fix a typo in a source code comment
- s/enought/enough/

MFC after:	3 days
2022-06-04 15:32:59 +02:00
Gordon Bergling
bd9e23c0a9 rack: Fix a typo in a source code comment
- s/continous/continuous/

MFC after:	3 days
2022-06-04 13:27:29 +02:00
Michael Tuexen
a5c2009dd8 sctp: improve handling of sctp inpcb flags
Use an atomic operation when the inp is not write locked.

Reported by:	syzbot+bf27083e9a3f8fde8b4d@syzkaller.appspotmail.com
MFC after:	3 days
2022-06-04 07:38:19 +02:00
Gordon Bergling
21b923c330 tcp_rack: Fix two typos in sysctl descriptions
- s/higest/highest/

MFC after:	3 days
2022-06-04 11:24:18 +02:00
Hans Petter Selasky
28173d49dc tcp: Correctly compute the retransmit length for all 64-bit platforms.
When the TCP sequence number subtracted is greater than 2**32 minus
the window size, or 2**31 minus the window size, the use of unsigned
long as an intermediate variable, may result in an incorrect retransmit
length computation on all 64-bit platforms.

While at it create a helper macro to facilitate the computation of
the difference between two TCP sequence numbers.

Differential Revision:	https://reviews.freebsd.org/D35388
Reviewed by:	rscheff
MFC after:	3 days
Sponsored by:	NVIDIA Networking
2022-06-03 10:49:17 +02:00
Arseny Smalyuk
d18b4bec98 netinet6: Fix mbuf leak in NDP
Mbufs leak when manually removing incomplete NDP records with pending packet via ndp -d.
It happens because lltable_drop_entry_queue() rely on `la_numheld`
counter when dropping NDP entries (lles). It turned out NDP code never
increased `la_numheld`, so the actual free never happened.

Fix the issue by introducing unified lltable_append_entry_queue(),
common for both ARP and NDP code, properly addressing packet queue
maintenance.

Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35365
MFC after:	2 weeks
2022-05-31 21:06:14 +00:00
KUROSAWA Takahiro
77001f9b6d lltable: introduce the llt_post_resolved callback
In order to decrease ifdef INET/INET6s in the lltable implementation,
introduce the llt_post_resolved callback and implement protocol-dependent
code in the protocol-dependent part.

Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35322
MFC after:	2 weeks
2022-05-30 10:53:33 +00:00
Michael Tuexen
a6a596e102 sctp: improve handling of listen() call
Fail the listen() call for 1-to-1 style sockets when the SCTP
association has been shutdown or aborted.

Reported by:	syzbot+6c484f116b9dc88f7db1@syzkaller.appspotmail.com
MFC after:	3 days
2022-05-29 20:40:30 +02:00
Dmitry Chagin
31d1b816fe sysent: Get rid of bogus sys/sysent.h include.
Where appropriate hide sysent.h under proper condition.

MFC after:	2 weeks
2022-05-28 20:52:17 +03:00
Michael Tuexen
2646cd0858 sctp: use a consistent view of the send parameters
Reported by:	syzbot+e26628a755f78bacff16@syzkaller.appspotmail.com
MFC after:	3 days
2022-05-28 19:35:58 +02:00
Michael Tuexen
e2ceff3028 sctp: ignore SCTP_SENDALL flag on 1-to-1 style sockets
MFC after:	3 days
2022-05-28 19:07:10 +02:00
Michael Tuexen
64b297e803 sctp: improve handling of send() when association is shutdown
Accept send() calls only when the association is not being
shut down or the expicit message EOR mode is used and the
application provides follow-up data.

Reported by:	syzbot+341e9ebd9d24ca7dc62a@syzkaller.appspotmail.com
MFC after:	3 days
2022-05-28 17:40:17 +02:00
Michael Tuexen
f21168e614 sctp: cleanup of error paths
MFC after:	3 days
2022-05-28 17:15:14 +02:00
Michael Tuexen
9cb70cb476 sctp: cleanup, no functional change except on error paths
MFC after:	3 days
2022-05-28 11:34:20 +02:00
Konrad Sewiłło-Jopek
c9a5c48ae8 arp: Implement sticky ARP mode for interfaces.
Provide sticky ARP flag for network interface which marks it as the
"sticky" one similarly to what we have for bridges. Once interface is
marked sticky, any address resolved using the ARP will be saved as a
static one in the ARP table. Such functionality may be used to prevent
ARP spoofing or to decrease latencies in Ethernet networks.

The drawbacks include potential limitations in usage of ARP-based
load-balancers and high-availability solutions such as carp(4).

The implemented option is disabled by default, therefore should not
impact the default behaviour of the networking stack.

Sponsored by:		Conclusive Engineering sp. z o.o.
Reviewed By:		melifaro, pauamma_gundo.com
Differential Revision: https://reviews.freebsd.org/D35314
MFC after:		2 weeks
2022-05-27 12:41:30 +00:00
Michael Tuexen
5cebd8305a sctp: more sb_cc related cleanups
No functional change intended. It allows a simpler patch for PR 260116.

MFC after:	3 days
2022-05-23 16:09:23 +02:00
Gleb Smirnoff
b46667c63e sockbuf: merge two versions of sbcreatecontrol() into one
No functional change.
2022-05-17 10:10:42 -07:00
Michael Tuexen
edc5b6ea88 sctp: use sb_avail() when accessing sb_acc for reading
This is a cleanup to simplify a patch for PR 260116.

PR:		260116
MFC after:	3 days
2022-05-14 12:38:43 +02:00
Michael Tuexen
f210e4fbc5 sctp: cleanup, no functional change intended
MFC after:	3 days
2022-05-14 08:30:41 +02:00
Michael Tuexen
aab6e5bd1e sctp: improve path verification
Ensure that a HB can be sent faster than a HB.Interval when performing
path verification of a reachable peer address.

Thanks to Alexander Funke for finding the issue and proposing a fix.

MFC after:	3 days
2022-05-14 08:07:28 +02:00
Michael Tuexen
9312ba239e sctp: improve path verification
When sending path confirmation heartbeats, do not take HB.interval
into account when the path is still reachable.

Thanks to Alexander Funke for finding the issue and suggesting a fix.

MFC after:	3 days
2022-05-14 08:05:03 +02:00
Michael Tuexen
9b2a35b3a9 sctp: improve consistency
No functional change intended.

MFC after:	3 days
2022-05-14 06:28:19 +02:00
Mitchell Horne
38a36057ae netdump: check the support status of the interface
If the interface does not support debugnet(4) we should bail early,
rather than having the user find this out at the time of the panic.
dumpon(8) already expects this return value and will print a helpful
error message.

Reviewed by:	cem, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D35180
2022-05-14 10:27:53 -03:00
Gleb Smirnoff
808b7d80e0 mbuf: remove PH_vt alias for mbuf packet header persistent shared data
Mechanical sed change s/PH_vt\.vt_nrecs/vt_nrecs/g
2022-05-13 13:32:43 -07:00
Mitchell Horne
489ba22236 kerneldump: remove physical argument from d_dumper
The physical address argument is essentially ignored by every dumper
method. In addition, the dump routines don't actually pass a real
address; every call to dump_append() passes a value of zero for
physical.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35173
2022-05-13 10:42:48 -03:00
Gleb Smirnoff
4581cffb3d sockets: fix build, convert missed sbreserve_locked() calls
Fixes:	4328318445
2022-05-12 14:29:19 -07:00
Gleb Smirnoff
4328318445 sockets: use socket buffer mutexes in struct socket directly
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it.  In 74a68313b5 macros that access
this mutex directly were added.  Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally.  However, it allows operation of a
protocol that doesn't use them.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35152
2022-05-12 13:22:12 -07:00
Randall Stewart
04831efd9f tcp: Rack idle reduce not working.
Rack converted to micro-seconds quite some time ago, but in testing
we have found a miss in that work. The idle reduce time is still based
in ticks, so it must be converted to microseconds before any comparisons
else you will likely not do idle reduce.

Reviewed by: tuexen, thj
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D35066
2022-05-10 09:46:05 -04:00
Kristof Provost
017e7d0390 in_rss: fix set but not used warning
If 'options RSS' is set.

MFC after:	1 week
Sponsored by:	Orange Business Services
2022-05-07 18:17:33 +02:00
Michael Tuexen
490a0f77de sctp: improve locking
While there, do some cleanup.

Reported by:	syzbot+f475e054c454310bc26d@syzkaller.appspotmail.com
MFC after:	3 day
2022-04-27 16:07:31 +02:00
Michael Tuexen
89c6aba7cf sctp: cleanup
MFC after:	3 days
2022-04-19 21:40:22 +02:00
Michael Tuexen
868868f14e sctp: improve stopping of timers
Reported by:	syzbot+c9c70062320aaad19de7@syzkaller.appspotmail.com
MFC after:	3 days
2022-04-19 21:29:41 +02:00
Alan Somers
8c47d8f538 prometheus_sysctl_exporter: fix metric aliasing
When exporting sysctls to Prometheus, the exporter replaces "." with
"_".  This caused several metrics to alias, confusing the Prometheus
server.  Fix it by:

* Renaming the "tcp_log_bucket" UMA zone to "tcp_log_id_bucket".  Also,
  rename "tcp_log_node" to "tcp_log_id_node" for consistency.

* Not exporting sysctls with "(LEGACY)" in the description.  That is
  used by ZFS sysctls that have been replaced by others, many of which
  alias to the same Prometheus metric name (like "vfs.zfs.arc_max" and
  "vfs.zfs.arc.max").

PR:		259607
Reported by:	delphij
MFC after:	2 weeks
Sponsored by:	Axcient
Reviewed by:	delphij,rew,thj
Differential Revision: https://reviews.freebsd.org/D34952
2022-04-19 06:56:39 -06:00
Mateusz Guzik
b338b1fd50 tcp: plug set-but-not-used vars
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-19 12:45:57 +00:00
Mateusz Guzik
db2ce6914b sctp: plug set-but-not-used vars
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-19 12:45:57 +00:00
Michael Tuexen
a12d89332e sctp: hold the inp lock while calling ip6_output
This fixes an issue with handling IPPROTO_IPV6 level socket
options.

Reported by:	syzbot+66ede232c3d1271c6226@syzkaller.appspotmail.com
MFC after:	3 days
2022-04-19 13:03:08 +02:00
Mateusz Guzik
0fd5c29944 tcp/rack: plug a set-but-not-used var
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-19 09:33:35 +00:00
Michael Tuexen
bbf3bf3211 sctp: cleanup
MFC after:	3 days
2022-04-16 21:03:16 +02:00
Michael Tuexen
5fbf11f703 sctp: fix typo introcuded in last commit
MFC after:	3 days
2022-04-16 19:55:33 +02:00
Michael Tuexen
3dc57df91e sctp: don't wakeup 1-to-1 listening sockets for data or notifications
Reported by:	syzbot+ec9279d306a4ff0215f8@syzkaller.appspotmail.com
Reported by:	syzbot+31d54f6d486333493dd4@syzkaller.appspotmail.com
MFC after:	3 days
2022-04-16 19:42:27 +02:00
Mitchell Horne
0a90043e63 Remove 12.x ABI compat for kernel dump ioctls
This code was marked gone_in(14), so it can now be removed.

The only consumer of this interface is dumpon(8). We do not maintain
strict backwards compatibility for this utility because a) it
can't/shouldn't be used from a jail or chroot and b) it is highly
specific interface unique to FreeBSD. The host's (presumably more
up-to-date) copy of dumpon(8) should be used to configure kernel dump
devices.

Reviewed by:	markj, emaste
MFC after:	never
Differential Revision:	https://reviews.freebsd.org/D34914
2022-04-15 12:06:05 -03:00
Mitchell Horne
9c90bfcd31 Remove 11.x ABI compat for kernel dump ioctls
This code was marked gone_in(13), so its time has passed.

The only consumer of this interface is dumpon(8). We do not maintain
strict backwards compatibility for this utility because a) it
can't/shouldn't be used from a jail or chroot and b) it is highly
specific interface unique to FreeBSD. The host's (presumably more
up-to-date) copy of dumpon(8) should be used to configure kernel dump
devices.

Reviewed by:	markj, emaste
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D34913
2022-04-15 12:06:04 -03:00
Michael Tuexen
eeba222172 sctp: don't keep a pointer to a freed stcb around
Reported by:	syzbot+b9ef06efdae7cb9ee414@syzkaller.appspotmail.com
Reported by:	syzbot+b1e4793e0e6b25b0d510@syzkaller.appspotmail.com
MFC after:	3 days
2022-04-15 14:00:00 +02:00
Michael Tuexen
e0127ea4c6 sctp: improve locking
Hold a refcount while giving up an stcp lock. This issue was
found by running syzkaller.

MFC after:	3 days
2022-04-15 13:58:45 +02:00
Randall Stewart
6edfc10ca5 tcp: adding a functionality to define "trace points" so that BB logging can be enabled at specific events.
This commit will add a new concept to rack, tracepoints. A tracepoint
is a defined point inserted into the code (3 are included in this initial patch) that
allows a developer to insert a point that might be of interest. The developer numbers
the point in the tcp_rack.h file and then can use sysctl to enable that (or all) trace
points. A limit is also given to how many BB logged connections will turn on
so that a box is not overrun by BB logging.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34898
2022-04-14 16:07:34 -04:00
Randall Stewart
6e6439b238 tcp - hpts timing is off when we are above 1200 connections.
HPTS timing begins to go off when we reach the threshold of connections (1200 by default)
where we have any returning syscall or LRO stop finding the oldest hpts thread that
has not run but instead using the CPU it is on. This ends up causing quite a lot of times
where hpts threads may not run for extended periods of time. On top of all that which
causes heartburn if you are pacing in tcp, you also have the fact that where AMD's
podded L3 cache may have sets of 8 CPU's that share a L3, hpts is unaware of this
and thus on amd you can generate a lot of cache misses.

So to fix this we will get rid of the CPU mode, and always use oldest. But also make
HPTS aware of the CPU topology and keep the "oldest" to be within the same L3 cache.
This also works nicely for NUMA as well couple with Drew's earlier NUMA changes.

Reviewed by: glebius, gallatin, tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34916
2022-04-14 16:04:08 -04:00
Michael Tuexen
2486a7c0c7 sctp: cleanup
MFC after:	3 days
2022-04-14 21:52:25 +02:00
John Baldwin
39f7de587b divert_packet: ip is only used for SCTP. 2022-04-13 16:08:23 -07:00
John Baldwin
fe5324aca0 in_pcballoc: error is only used for IPSEC or MAC. 2022-04-13 16:08:23 -07:00
John Baldwin
f328c46fdd TCP sysctl handlers: fin and lin are only used for INET. 2022-04-13 16:08:21 -07:00
John Baldwin
700a395c58 tcp_log_vain/addrs: Use a const pointer for the IPv4 header.
The pointer to the IPv6 header was already const.
2022-04-13 16:08:21 -07:00
John Baldwin
13ec6858d6 tcp_log_addr: ip is only used for INET. 2022-04-13 16:08:21 -07:00
John Baldwin
29a843177e sctp: #ifdef INET-only and INET6-only variables.
Duplicating the SCTP_PCB_FLAGS_BOUND_V6 check made the #ifdef's
simpler than applying #ifdef's directly to the original code.  Modern
compilers should cache the result rather than testing the flag twice.
2022-04-13 16:08:21 -07:00
John Baldwin
732b6d4d50 netinet: Use __diagused for variables only used in KASSERT(). 2022-04-13 16:08:19 -07:00
Michael Tuexen
595ac4a118 sctp: fix parameter type in NAT status message
Thanks to Sriram Yagnaraman for providing the patch for the
userland stack.

MFC after:	3 days
2022-04-13 19:46:28 +02:00
Richard Scheffenegger
033718abc8 tcp: Whitespace cleanup in brr and rack
Whitespace cleanup (leading spaces to tabs)
Nicefy function definitions with indentations

No functional change

Reviewed By: #transport, thj
Sponsored by:   NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D30043
2022-04-13 12:49:57 +02:00
John Baldwin
86fa80f320 rack: Remove unused variable. 2022-04-12 14:59:00 -07:00
John Baldwin
90948e8c2e sctp: Remove unused variable. 2022-04-12 14:58:59 -07:00
John Baldwin
bab34d6349 in_pcboutput_txrtlmt: Remove unused variable. 2022-04-12 14:58:59 -07:00
Kristof Provost
742e7210d0 udp: allow udp_tun_func_t() to indicate it did not eat the packet
Allow udp tunnel functions to indicate they have not taken ownership of
the packet, and that normal UDP processing should continue.

This is especially useful for scenarios where the kernel has taken
ownership of a socket that was originally created by userspace. It
allows the tunnel function to pass through certain packets for userspace
processing.

The primary user of this is if_ovpn, when it receives messages from
unknown peers (which might be a new client).

Reviewed by:	tuexen
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34883
2022-04-12 10:04:59 +02:00
Mike Karels
6ca0ca7b4c IPv4 multicast: fix LOR in shutdown path
X_ip_mrouter_done() was calling the interface ioctl routines via
if_allmulti() while holding a write lock.  However, some interface
ioctl routines, including em/iflib and tap, use sxlocks, which are
not permitted while holding a non-sleepable lock, and this elicits
a warning from WITNESS.  Fix the locking issue by recording the
affected interface pointers in a malloc'ed array, and call
if_allmulti() on each after dropping the rwlock.

Reviewed by:	bz
Differential Revision: https://reviews.freebsd.org/D34845
2022-04-11 14:51:16 -05:00
Andrey V. Elsukov
7d98cc096b Fix ipfw fwd that doesn't work in some cases
For IPv4 use dst pointer as destination address in fib4_lookup().
It keeps destination address from IPv4 header and can be changed
when PACKET_TAG_IPFORWARD tag was set by packet filter.

For IPv6 override destination address with address from dst_sa.sin6_addr,
that was set from PACKET_TAG_IPFORWARD tag.

Reviewed by:	eugen
MFC after:	1 week
PR:		256828, 261697, 255705
Differential Revision: https://reviews.freebsd.org/D34732
2022-04-11 14:16:43 +03:00
Gordon Bergling
2dd0c2bc7f tcp_bbr(4): Fix a typo in a source code comment
- s/possiblity/possibility/

MFC after:	3 days
2022-04-09 13:26:20 +02:00
Gordon Bergling
addb2c6585 tcp_rack: Fix a typo in a source code comment
- s/possiblity/possibility/

MFC after:	3 days
2022-04-09 13:25:50 +02:00
Gordon Bergling
1cfd924f4e libalias(3): Fix two typos in source code comments
- s/modfied/modified/

MFC after:	3 days
2022-04-09 09:14:00 +02:00
Gordon Bergling
36814092d4 tcp_rack: Fix a few typos in sysctl descriptions and comments
- s/postion/position/
- s/postions/positions/
- s/repostion/reposition/

MFC after:	5 days
2022-04-09 09:13:10 +02:00
Gordon Bergling
1f2aaef29a tcp_htps: Fix a typo in a source code comment
- s/postion/position/

MFC after:	3 days
2022-04-09 09:12:58 +02:00
Gordon Bergling
665709016d tcp_bbr(4): Fix two typos in source code comments
- s/postive/positive/
- s/postion/position/

MFC after:	3days
2022-04-09 09:12:48 +02:00
John Baldwin
3b994db74b Use stub inline functions for no-op versions of tcp_fastopen*().
Inline functions "use" variables passed as arguments unlike empty
macros appeasing compiler warnings about unused variables.
2022-04-08 17:25:13 -07:00
Gordon Bergling
4d6883cbe2 tcp_bbr(4): Fix a typo in a sysctl description and a comment
- s/postive/positive/

MFC after:	5 days
2022-04-08 21:08:18 +02:00
Mark Johnston
990a6d18b0 net: Fix memory leaks in lltable_calc_llheader() error paths
Also convert raw epoch_call() calls to lltable_free_entry() calls, no
functional change intended.  There's no need to asynchronously free the
LLEs in that case to begin with, but we might as well use the lltable
interfaces consistently.

Noticed by code inspection; I believe lltable_calc_llheader() failures
do not generally happen in practice.

Reviewed by:	bz
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34832
2022-04-08 11:47:25 -04:00
Mark Johnston
dd91d84486 net: Fix LLE lock leaks
Historically, lltable_try_set_entry_addr() would release the LLE lock
upon failure.  After some refactoring, it no longer does so, but
consumers were not adjusted accordingly.

Also fix a leak that can occur if lltable_calc_llheader() fails in the
ARP code, but I suspect that such a failure can only occur due to a code
bug.

Reviewed by:	bz, melifaro
Reported by:	pho
Fixes:		0b79b007eb ("[lltable] Restructure nd6 code.")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34831
2022-04-08 11:46:19 -04:00
Michael Tuexen
d7224a53b3 sctp: remove a mutex not used anymore
MFC after:	3 days
2022-04-07 17:54:57 +02:00
Michael Tuexen
3c3d77bdff sctp: use variable names in a consistent way
No functional change intended.

MFC after:	3 days
2022-04-07 17:51:31 +02:00
Tom Jones
1241e8e7ae siftr: expose t_flags2 in siftr output
Replace the old snd_bwnd field which was kept for compatibility with the
t_flags2 field from the tcpcb. This exposes in siftr logs interesting
things such as ECN, PLPMTUD, Accurate ECN and if first bytes are
complete.

Reviewed by:	rscheff (transport), chengc_netapp.com,  debdrup (manpages)
Sponsored by:   NetApp, Inc.
Sponsored by:   Klara, Inc.
X-NetApp-PR:    #73
Differential Revision:	https://reviews.freebsd.org/D34672
2022-04-07 10:17:09 +01:00
John Baldwin
6454d0c8cb libalias: Remove unused variables. 2022-04-06 16:45:29 -07:00
John Baldwin
3f6d3f0285 alias_nbt: Move debug-only variable under #ifdef LIBALIAS_DEBUG. 2022-04-06 16:45:29 -07:00
John Baldwin
8b6ccfb6c7 multicast code: Quiet unused warnings for variables used for KTR traces.
For nallow and nblock, move the variables under #ifdef KTR.

For return values from functions logged in KTR traces, mark the
variables as __unused rather than having to #ifdef the assignment of
the function return value.
2022-04-06 16:45:28 -07:00
Michael Tuexen
ccdfd621d0 tcp cc: don't recurse on non recursive mutex
This issue was found by syzkaller.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D34743
2022-04-05 13:52:36 +02:00
Navdeep Parhar
08c7f1b6d4 Fix typo (interrups -> interrupts) in a sysctl description in tcp_lro.c.
MFC after:	3 days
2022-04-04 13:48:32 -07:00
Michael Tuexen
52106f072f sctp: don't refer to a potentially outdated stream
Reported by:	syzbot+1593381019112e5bb35c@syzkaller.appspotmail.com
MFC after:	3 days
2022-04-02 23:26:27 +02:00
Michael Tuexen
b30b7a140c sctp: cleanup, no functional change
MFC after:	3 days
2022-04-02 23:02:16 +02:00
Michael Tuexen
0f31631620 sctp: remove a test, which isn't safe
We can't ensure the stcb is still around. This issue was found
 by syzkaller.

MFC after:	3 days
2022-04-02 15:09:50 +02:00
Michael Tuexen
d4290f7e62 Revert "sctp: remove a test, which isn't safe"
It included unrelated changes still under review.
This reverts commit b1fe92b28b.
2022-04-02 14:49:14 +02:00
Michael Tuexen
b1fe92b28b sctp: remove a test, which isn't safe
We can't ensure the stcb is still around. This issue was found
by syzkaller.

MFC after:	3 days
2022-04-02 14:44:06 +02:00
Gordon Bergling
942e8cab8c netinet: Fix a typo in a source code comment
- s/exisitng/existing/

MFC after:	3 days
2022-04-02 14:39:32 +02:00
Gordon Bergling
8d30ef92d5 khelp(9): Fix a typo in a source code comment
- s/measurment/measurement/

MFC after:	3 days
2022-04-02 14:10:59 +02:00
Gordon Bergling
17628f1b79 cc_vegas(4): Fix a typo in a source code comment
- s/measurment/measurement/

MFC after:	3 days
2022-04-02 14:07:44 +02:00
Michael Tuexen
39a22011bb sctp: clear pointer to stack when returning from function.
Reported by:    syzbot+04cee5d8805dfbb63c06@syzkaller.appspotmail.com
Reported by:    syzbot+71e7e33dfc3cc39a6bd0@syzkaller.appspotmail.com
Reported by:    syzbot+6c36fc3c1bd03ed96107@syzkaller.appspotmail.com
Reported by:    syzbot+198b3751c158181c47de@syzkaller.appspotmail.com
2022-04-02 00:54:49 +02:00
Randall Stewart
e88412d89b Opps sorry, typo in the cc_cubic fix when morphing it from nreno. 2022-04-01 08:37:04 -04:00
Randall Stewart
653cf466f0 hystart++ may not properly exit CSS back to slowstart.
In the changes to get hystart++ into cubic an inadvertent line
was removed in the conditional to figure out if you need to exit
hystart++ back to slowstart. The line of course is the most crucial
one (the others are valid but not critical) i.e. is the new rtt
less than the point where we entered hystart++. Without the line
we end up bouncing in and out of CSS.

Reported By: Reese Enghardt
Sponsored By: Netflix Inc.
2022-04-01 08:33:44 -04:00
Randall Stewart
ee1a08b8da rack may end up with a stuck connectin fi the rwnd is colapsed on sent data.
There is a case where rack will get stuck when it has outstanding data and
the peer collapses the rwnd down to 0. This leaves the session hung if
the rwnd update is not received. You can test this with the packet drill script
below. Without this fix it will be stuck and hang. With it we retransmit everything.
This also fixes the mtu retransmit case so we don't go into recovery when
the mtu is changed to a smaller value.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34573
2022-04-01 08:29:27 -04:00
George V. Neville-Neil
ca4cd20c4a Address issue pointed out in CVE-2020-25705
Add jitter to the ICMP bandwidth limit to deny a side-channel port scan.

Reviewed by:	kp, philip, cy, emaste
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27354
2022-03-31 16:45:50 +02:00
Michael Tuexen
218e463b85 sctp: ensure that ASCONF chunks are not too large
MFC after:	3 days
2022-03-30 01:22:20 +02:00
Michael Tuexen
e7e65008ff sctp: fix typos
Thanks to David Sanders for fixing the typos in the userland stack.

MFC after:	3 days
2022-03-29 21:09:51 +02:00
Michael Tuexen
5d0c76c730 sctp: don't lock an already locked stcb.
Reported by:	syzbot+e8dca84da3b4b82f4400@syzkaller.appspotmail.com
MFC after:	3 days
2022-03-29 16:33:53 +02:00
Michael Tuexen
5ac91821f5 sctp: get rid of stcb send lock
Just use the stcb lock instead to simplify locking.

Reported by:	syzbot+d00b202063150f85b110@syzkaller.appspotmail.com
Reported by:	syzbot+87f268a0a6d2d6383306@syzkaller.appspotmail.com
MFC after:	3 days
2022-03-29 01:50:17 +02:00
Gordon Bergling
75fdc440c8 extra_tcp_stacks: Fix two typos in source code comments
- s/recusive/recursive/

MFC after:	3 days
2022-03-28 19:34:45 +02:00
Mike Karels
04cd74b4cd IPv4 multicast: fix netstat -g
The vif structure includes fields at the end which are #ifdef KERNEL,
causing a mismatch between the structure sizes between kernel and
user level.  netstat -g failed with an ENOMEM on the sysctl to fetch
the vif table.  Change the vif sysctl code in ip_mroute to copy out
only the user-level-visible portion of each table entry.

Reviewed by:	bz, wma
Differential Revision: https://reviews.freebsd.org/D34627
2022-03-22 07:38:01 -05:00
Mike Karels
2cf1e120c6 Enter epoch when addding IPv4 multicast forwarding cache entry
The code path from the IPv4 multicast setsockopt could call ip_output()
without entering an epoch.  Specifically, the MRT_ADD_MFC setbsocopt
would call add_mfc(), which in turn called ip_mdq() to send queued
packets.  This resulted in an epoch assert failure in ip_output().
Enter an epoch in add_mfc(), and add some epoch asserts to check
for similar failures.

Reviewed by:	kp, bz, wma, cy
Differential Revision: https://reviews.freebsd.org/D34624
2022-03-22 07:28:57 -05:00
Mark Johnston
9f70c04da4 rip: Fix a -Wunused-but-set-variable warning
Fixes:		81728a538d ("Split rtinit() into multiple functions.")
Reviewed by:	imp, melifaro
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D34395
2022-03-01 09:39:43 -05:00
Richard Scheffenegger
2ff07d9220 tcp: Restore correct ECT marking behavior on SACK retransmissions
While coalescing all ECN-related code into new common source files,
the flag to deal with SACK retransmissions was skipped. This leads
to non-compliant ECT-marking of SACK retransmissions, as well as
the premature sending of other TCP ECN flags (CWR).

Reviewed By: rrs, #transport
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34376
2022-02-25 20:05:32 +01:00
Randall Stewart
a43b0aca12 tcp: Push bit failure to set in fastpath
Recently changes were made to the tcp stack to use a macro/function
to set tcp flags. In the process the PUSH bit setting in the fastpath of
rack was broken. This fixes that as well as cleans up a warning that
is occurring when you don't have INVARIANT on (inp used in KASSERT).

We can use the tcp test suite to find this bug the test plan shows the script
that fails due to the missing push bit

Reviewed by: rscheff, tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34332
2022-02-23 16:25:56 -05:00
Randall Stewart
ea9017fb25 tcp: Congestion control move to using reference counting.
In the transport call on 12/3 Gleb asked to move the CC modules towards
using reference counting to prevent folks from unloading a module in use.
It was also agreed that Michael would do a user space utility like tcp_drop
that could be used to move all connections that are using a specific CC
to some other CC.

This is the half I committed to doing, making it so that we maintain a refcount
on a cc module every time a pcb refers to it and decrementing that every
time a pcb no longer uses a cc module. This also helps us simplify the
whole unloading process by getting rid of tcp_ccunload() which munged
through all the tcb's. Instead we mark a module as being removed and
prevent further references to it. We also make sure that if a module is
marked as being removed it cannot be made as the default and also
the opposite of that, if its a default it fails and does not mark it as being
removed.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33249
2022-02-21 06:30:17 -05:00
Michael Tuexen
bdb99f6f5e sctp: remove KASSERT() which not always holds
Reported by:	syzbot+c907045aed2043011f3c@syzkaller.appspotmail.com
MFC after:	3 days
2022-02-20 15:59:21 +01:00
Michael Tuexen
e255f0c9fb sctp: make sure new locking requirements are satisfied.
Reported by:	syzbot+cd3c1dd64861b8c200bd@syzkaller.appspotmail.com
MFC after:	3 days
2022-02-20 15:36:26 +01:00
Michael Tuexen
2f0656fb9b sctp: don't hold the assoc create lock longer than needed
Reported by:	syzbot+c738e3df67cf425c49a2@syzkaller.appspotmail.com
MFC after:	3 days
2022-02-20 14:55:41 +01:00
Michael Tuexen
a4a31271cc sctp: cleanup sctp_lower_sosend
This is a preparation for retiring the tcp send lock in the
next step.

MFC after:	3 days
2022-02-20 01:09:30 +01:00
Michael Tuexen
fd0d53f85c sctp: improve robustness
MFC after:	3 days
2022-02-18 14:30:07 +01:00
Michael Tuexen
274a0e4a8d sctp: cleanup, no functional change intended.
MFC after:	3 days
2022-02-18 14:20:01 +01:00
Michael Tuexen
3ca204c97a sctp: remove unused parameter
MFC after:	3 days
2022-02-18 12:20:44 +01:00
Michael Tuexen
11c4d4b966 sctp: fix a signed/unsigned mismatch.
MFC after:	3 days
2022-02-17 22:45:57 +01:00
Michael Tuexen
76e03cc940 sctp: avoid undefined behaviour and cleanup the code.
MFC after:	3 days
2022-02-17 19:23:59 +01:00
Kristof Provost
995cba5a0c netinet: allow UDP tunnels to be removed
udp_set_kernel_tunneling() rejects new callbacks if one is already set.
Allow callbacks to be cleared. The use case for this is OpenVPN DCO,
where the socket is opened by userspace and then adopted by the kernel
to run the tunnel. If the DCO interface is removed but userspace does
not close the socket (something the kernel cannot prevent) the installed
callbacks could be called with an invalidated context.

Allow new functions to be set, but only if they're NULL (i.e. allow the
callback functions to be cleared).

Reviewed by:	tuexen
MFC after:	3 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34288
2022-02-16 10:59:04 +01:00
Richard Scheffenegger
0c2832ee4f tcp: Restore 6 tcps padding entries in HEAD
The padding in CURRENT shall not shrink. It is
designed that in CURRENT at always stays
the same, and then when a new stable is branched, it
inherits 6 pointer placeholders that can be used
withing this stable/X lifetime to extend the structure.

Reviewed By: tuexen, #transport
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34269
2022-02-15 09:24:07 +01:00
Bjoern A. Zeeb
232d323ef2 TCP syncache: enhance KASSERT output
Improve the "syncache: mbuf too small" assertion message with various
variables (some not actually needed) but enough that it will be obvious
if (a) we use IPv4 or IPv6, (b) if UDP tunneling is on, (c) what
max_linkhdr is, and (d) what MHLEN is.

This should help diagnostics in the future.
The case was hit with wireless drivers setting a large ic_headroom
and using IPv6.

Reviewed by:	gallatin, tuexen, rscheff
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D34217
2022-02-14 00:03:20 +00:00
Mark Johnston
b4f60fab5d tcp: Avoid conditionally defined fields in union lro_address
The layout of the structure ends up depending on whether the including
file includes opt_inet.h and opt_inet6.h, so different compilation units
can end up seeing different versions of the structure.  Fix this by
unconditionally defining the address fields.

As a side effect, this eliminates some duplication in the kernel's CTF
type graph.

Reviewed by:	rscheff, tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34242
2022-02-10 15:39:58 -05:00
Richard Scheffenegger
3f169c54ab tcp: Add/update AccECN related statistics and numbers
Reserve couters in the tcps struct in preparation
for AccECN, extend the debugging output for TF2
flags, optimize the syncache flags from individual
bits to a codepoint for the specifc ECN handshake.

This is in preparation of AccECN.

No functional chance except for extended debug
output capabilities.

Reviewed By: #transport, rrs
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34161
2022-02-10 00:21:31 +01:00
Randall Stewart
cc41c17433 opps my patch lost the removal of the tlp_threshold counter increments 2022-02-09 16:19:22 -05:00
Randall Stewart
8d64b4b4c4 cleanup of rack variables.
During a recent deep dive into all the variables so I could
discover why stack switching caused larger retransmits I examined
every variable in rack. In the process I found quite a few bits
that were not used and needed cleanup. This update pulls
out all the unused pieces from rack. Note there are *no* functional
changes here, just the removal of unused variables and a bit of
spacing clean up.

Reviewed by: Michael Tuexen, Richard Scheffenegger
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D34205
2022-02-09 16:08:32 -05:00
Michael Tuexen
a0aeb1cef5 in_pcb.c: fix compilation of an IPv4 only configuration
While there, remove a duplicate inclusion of sysctl.h.

Reported by:	Gary Jennejohn
Fixes:		a35bdd4489 - main - tcp: add sysctl interface for setting socket options
Sponsored by:	Netflix, Inc.
2022-02-09 19:58:29 +01:00
Michael Tuexen
a35bdd4489 tcp: add sysctl interface for setting socket options
This interface allows to set a socket option on a TCP endpoint,
which is specified by its inp_gencnt. This interface will be
used in an upcoming command line tool tcpsso.

Reviewed by:		glebius, rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D34138
2022-02-09 12:24:41 +01:00
Michael Tuexen
528c764924 tcp: fix compliation when KERN_TLS is not defined
Reported by:	Gary Jennejohn
Fixes:		fd7daa7271 - main - tcp: make tcp_ctloutput_set() non-static
Sponsored by:	Netflix, Inc.
2022-02-09 12:16:43 +01:00