Commit Graph

7655 Commits

Author SHA1 Message Date
Mateusz Guzik
c67eb393fa tcp_hpts: plug a compiler warn
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-04-05 14:32:13 +00:00
Gleb Smirnoff
84b42df834 rack: fix build on powerpc 2023-04-04 16:35:36 -07:00
Randall Stewart
030434acaf Update rack to the latest code used at NF.
There have been many changes to rack over the last couple of years, including:
     a) Ability when switching stacks to have one stack query another.
     b) Internal use of micro-second timers instead of ticks.
     c) Many changes to pacing in forms of
        1) Improvements to Dynamic Goodput Pacing (DGP)
        2) Improvements to fixed rate paciing
        3) A new feature called hybrid pacing where the requestor can
           get a combination of DGP and fixed rate pacing with deadlines
           for delivery that can dynamically speed things up.
     d) All kinds of bugs found during extensive testing and use of the
        rack stack for streaming video and in fact all data transferred
        by NF

Reviewed by: glebius, gallatin, tuexen
Sponsored By: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D39402
2023-04-04 16:05:46 -04:00
Gleb Smirnoff
2ff8187efd tcp_hpts: remove dead code tcp_drop_in_pkts()
Should have gone in f971e79139.
2023-04-04 12:55:27 -07:00
Randall Stewart
73ee5756de Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.
So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210
2023-04-01 01:46:38 -04:00
Kristof Provost
28921c4f7d carp: allow commands to use interface name rather than index
Get/set commands can now choose to provide the interface name rather
than the interface index. This allows userspace to avoid a call to
if_nametoindex().

Suggested by:	melifaro
Reviewed by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D39359
2023-03-31 11:29:58 +02:00
Richard Scheffenegger
f858eb916f tcp: send SACK rescue retransmission also mid-stream
Previously, SACK rescue retransmissions would only happen
on a loss recovery at the tail end of the send buffer.

This extends the mechanism such that partial ACKs without SACK
mid-stream also trigger a rescue retransmission to try avoid
an otherwise unavoidable retransmission timeout.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D39274
2023-03-28 04:47:01 +02:00
Gleb Smirnoff
78e6c3aacc tcp: update error counter when dropping a packet due to bad source
Use the same counter that ip_input()/ip6_input() use for bad destination
address.  For IPv6 this is already heavily abused ip6s_badscope, which
needs to be split into several separate error counters.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D39234
2023-03-27 18:37:15 -07:00
Kristof Provost
ccff2078af carp: fix source MAC
When we're not in unicast mode we need to change the source MAC address.
The check for this was wrong, because IN_MULTICAST() assumes host
endianness and the address in sc_carpaddr is in network endianness.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-03-28 01:18:18 +02:00
Alexander V. Chernikov
19e43c163c netlink: add netlink KPI to the kernel by default
This change does the following:

Base Netlink KPIs (ability to register the family, parse and/or
 write a Netlink message) are always present in the kernel. Specifically,
* Implementation of genetlink family/group registration/removal,
  some base accessors (netlink_generic_kpi.c, 260 LoC) are compiled in
  unconditionally.
* Basic TLV parser functions (netlink_message_parser.c, 507 LoC) are
  compiled in unconditionally.
* Glue functions (netlink<>rtsock), malloc/core sysctl definitions
 (netlink_glue.c, 259 LoC) are compiled in unconditionally.
* The rest of the KPI _functions_ are defined in the netlink_glue.c,
 but their implementation calls a pointer to either the stub function
 or the actual function, depending on whether the module is loaded or not.

This approach allows to have only 1k LoC out of ~3.7k LoC (current
 sys/netlink implementation) in the kernel, which will not grow further.
It also allows for the generic netlink kernel customers to load
 successfully without requiring Netlink module and operate correctly
 once Netlink module is loaded.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D39269
2023-03-27 13:55:44 +00:00
Andrew Gallatin
abba58766f LRO: Add missing checks for invalid IP addresses
LRO bypasses normal ip_input()/tcp_input() and lacks several checks
that are present in the normal path.  Without these checks, it
is possible to trigger assertions added in b0ccf53f24

Reviewed by: glebius, rrs
Sponsored by: Netflix
2023-03-25 11:56:02 -04:00
Kristof Provost
511a6d5ed3 carp: use if_name()
Reported by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-03-20 14:37:10 +01:00
Kristof Provost
137818006d carp: support unicast
Allow users to configure the address to send carp messages to. This
allows carp to be used in unicast mode, which is useful in certain
virtual configurations (e.g. AWS, VMWare ESXi, ...)

Reviewed by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D38940
2023-03-20 14:37:09 +01:00
Kristof Provost
40e0435964 carp: add netlink interface
Allow carp configuration information to be supplied and retrieved via
netlink.

Reviewed by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D39048
2023-03-20 10:52:27 +01:00
Michael Tuexen
48345048cd sctp: fix typo in assignment 2023-03-18 23:58:50 +01:00
Michael Tuexen
8ed1e2c880 sctp: enforce Kahn's rule during the handshake
Don't take RTT measurements on packets containing INIT or COOKIE-ECHO
chunks, when they were retransmitted.

MFC after:	1 week
2023-03-16 17:40:40 +01:00
Randall Stewart
69c7c81190 Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities.
The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.

Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831
2023-03-16 11:43:16 -04:00
Zhenlei Huang
49cad3daf2 carp: carp_master_down_locked() requires net epoch
Reviewed by:	kp
Fixes:		1d126e9b94 carp: Widen epoch coverage
MFC after:	1 day
Differential Revision:	https://reviews.freebsd.org/D39113
2023-03-16 18:07:03 +08:00
Michael Tuexen
c91ae48a25 sctp: don't do RTT measurements with cookies
When receiving a cookie, the receiver does not know whether the
peer retransmitted the COOKIE-ECHO chunk or not. Therefore, don't
do an RTT measurement. It might be much too long.
To overcome this limitation, one could do at least two things:
1. Bundle the INIT-ACK chunk with a HEARTBEAT chunk for doing the
   RTT measurement. But this is not allowed.
2. Add a flag to the COOKIE-ECHO chunk, which indicates that it
   is the initial transmission, and not a retransmission. But
   this requires an RFC.

MFC after:	1 week
2023-03-16 10:45:13 +01:00
Michael Tuexen
cee09bda03 sctp: allow disabling of SCTP_ACCEPT_ZERO_CHECKSUM socket option 2023-03-15 22:55:23 +01:00
Michael Tuexen
6026b45aab sctp: improve negotiation of zero checksum feature
Enforce consistency between announcing 0-cksum support and actually
using it in the association. The value from the inp when the
INIT ACK is sent must be used, not the one from the inp when the
cookie is received.
2023-03-15 22:29:52 +01:00
Mina Galić
0b0ae2e4cd jail: convert several functions from int to bool
these functions exclusively return (0) and (1), so convert them to bool

We also convert some networking related jail functions from int to bool
some of which were returning an error that was never used.

Differential Revision: https://reviews.freebsd.org/D29659
Reviewed by: imp, jamie (earlier version)
Pull Request: https://github.com/freebsd/freebsd-src/pull/663
2023-03-14 21:05:33 -06:00
Mark Johnston
aa71d6b4a2 netinet: Disallow unspecified addresses in ICMP-embedded packets
Reported by:	glebius
Reported by:	syzbot+981c528ccb5c5534dffc@syzkaller.appspotmail.com
Reviewed by:	tuexen, glebius
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D38936
2023-03-13 10:45:56 -04:00
Michael Tuexen
4a2b92d99f sctp: initial implementation of draft-tuexen-tsvwg-sctp-zero-checksum 2023-03-10 01:45:46 +01:00
Mark Johnston
713264f6b8 netinet: Tighten checks for unspecified source addresses
The assertions added in commit b0ccf53f24 ("inpcb: Assert against
wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol
layers may pass the unspecified address to in_pcblookup().

Add some checks to filter out such packets before we attempt an inpcb
lookup:
- Disallow the use of an unspecified source address in in_pcbladdr() and
  in6_pcbladdr().
- Disallow IP packets with an unspecified destination address.
- Disallow TCP packets with an unspecified source address, and add an
  assertion to verify the comment claiming that the case of an
  unspecified destination address is handled by the IP layer.

Reported by:	syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com
Reported by:	syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com
Reported by:	syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com
Reviewed by:	glebius, melifaro
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D38570
2023-03-06 15:06:00 -05:00
Fidaullah Noonari
290f7f4a09 in_mcat.c: change multicast not member condition
If there is no source filter entry => block if that's SSM ("exclude"
mode per RFC 3678 clause 3).  If there is an entry => check its action &
block if the action is "exclude".

It would be nice if the test case in this PR were converted into an ATF
test case, but not blocking on that.

Reviewed by: imp, melifaro
Pull Request: https://github.com/freebsd/freebsd-src/pull/601
2023-03-03 22:25:17 -07:00
Gleb Smirnoff
7fc82fd1f8 ipfw: garbage collect ip_fw_chk_ptr
It is a relict left from the old times when ipfw(4) was hooked
into IP stack directly, without pfil(9).
2023-03-03 10:30:15 -08:00
Mark Johnston
317fa5169d netinet: Remove the IP(V6)_RSS_LISTEN_BUCKET socket option
It has no effect, and an exp-run revealed that it is not in use.

PR:		261398 (exp-run)
Reviewed by:	mjg, glebius
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38822
2023-02-28 15:57:21 -05:00
Richard Scheffenegger
399a5655e6 tcp: Make TCP PCAP buffer properly configurable.
Reviewed By:		tuexen, cc, #transport
MFC after:		3 days
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D38824
2023-02-28 20:12:11 +01:00
Mark Johnston
3aff4ccdd7 netinet: Remove IP(V6)_BINDMULTI
This option was added in commit 0a100a6f1e but was never completed.
In particular, there is no logic to map flowids to different listening
sockets, so it accomplishes basically the same thing as SO_REUSEPORT.
Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to
balance among listening sockets using a hash of the 4-tuple and some
optional NUMA policy.

The option was never documented or completed, and an exp-run revealed
nothing using it in the ports tree.  Moreover, it complicates the
already very complicated in_pcbbind_setup(), and the checking in
in_pcbbind_check_bindmulti() is insufficient.  So, let's remove it.

PR:		261398 (exp-run)
Reviewed by:	glebius
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38574
2023-02-27 10:03:11 -05:00
Alfonso
2f201df1f8 Change hw_tls to a bool
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/512
2023-02-25 09:59:11 -07:00
Mateusz Guzik
3a01a97d23 mroute: partially sanitize the file
There is rampant inconsistent formatting all around, make it mostly
style(9)-conformant.

While here:
- drop malloc casts
- rename a rw lock from mroute_mtx to mroute_lock
- replace NOTREACHED comment with __assert_unreachable

Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D38652
2023-02-23 13:35:44 +00:00
Michael Tuexen
453aa7fac9 tcp: ensure the tcpcb is not NULL when logging an event
When calling tcp_bblog_pru() on some error paths, tp is NULL,
therefore handle it.

Sponsored by:	Netflix, Inc.
2023-02-23 02:04:17 +01:00
Michael Tuexen
624de4eca5 tcp: remove unused function prototype
tcp_trace was implemented in tcp_debug.c, which was removed recently.

Reviewed by:		rscheff@, zlei@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D38712
2023-02-22 13:28:17 +01:00
Michael Tuexen
76578d601e bblog: improve timeout event handling
Extend the BBLog RTO event to deal with all timers of the base
stack. Also provide information about starting, stopping, and
running off. The expiration of the retransmission timer is
reported as it was done before.

Reviewed by:		rscheff@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D38710
2023-02-21 22:46:15 +01:00
Michael Tuexen
6b802933f1 tcp: rearrange enum and remove unused variable
Rearrange the enum tt_which such that TT_REXMIT is 0. This allows
an extension of the BBLog event RTO in a backwards compatible way.
Remove tcptimers, which was only used in trpt, a utility removed
from the source tree recently.

Reviewed by:		glebius@, guest-ccui@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D38547
2023-02-21 18:26:49 +01:00
Michael Tuexen
4065becf3f bblog: unbreak build
Ensure that tp is always declared and set.

Reported by:	Michael Butler
Sponsored by:	Netflix, Inc.
2023-02-21 18:16:59 +01:00
Michael Tuexen
00812bbda2 bblog: add logging of protocol user requests
This information was available in trpt and is useful. So provide
a way to get this information via TCP BBLog.

Reviewed by:		rscheff@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D38701
2023-02-21 12:07:35 +01:00
Michael Tuexen
b16a37eda8 bblog: sync tcp_log_events with Netflix tree
This allows the addition of entries to tcp_log_events without
causing conflicts in the Netflix tree.
rrs@ will upstream the related functional changes eventually.

Reviewed by:		guest-ccui@, rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D38646
2023-02-20 21:42:57 +01:00
John Baldwin
cda6bdbaa1 tcp: Don't try to disconnect a socket multiple times.
When the checks for INP_TIMEWAIT were removed, tcp_usr_close() and
tcp_usr_disconnect() were no longer prevented from calling
tcp_disconnect() on a socket that was already disconnected.  This
triggered a panic in cxgbe(4) for TOE where the tcp_disconnect() on an
already-disconnected socket invoked tcp_output() on a socket that was
already in time-wait.

Reviewed by:	rrs, np
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D37112
2023-02-17 09:13:53 -08:00
Gleb Smirnoff
96871af013 inpcb: use family specific sockaddr argument for bind functions
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_bind method and from there on go down the call
stack with family specific argument.

Reviewed by:		zlei, melifaro, markj
Differential Revision:	https://reviews.freebsd.org/D38601
2023-02-15 10:30:16 -08:00
Gleb Smirnoff
caf32b260a pfil: add pfil_mem_{in,out}() and retire pfil_run_hooks()
The 0b70e3e78b changed the original design of a single entry point
into pfil(9) chains providing separate functions for the filtering
points that always provide mbufs and know the direction of a flow.
The motivation was to reduce branching.  The logical continuation
would be to do the same for the filtering points that always provide
a memory pointer and retire the single entry point.

o Hooks now provide two functions: one for mbufs and optional for
  memory pointers.
o pfil_hook_args() has a new member and pfil_add_hook() has a
  requirement to zero out uninitialized data. Bump PFIL_VERSION.
o As it was before, a hook function for a memory pointer may realloc
  into an mbuf.  Such mbuf would be returned via a pointer that must
  be provided in argument.
o The only hook that supports memory pointers is ipfw:default-link.
  It is rewritten to provide two functions.
o All remaining uses of pfil_run_hooks() are converted to
  pfil_mem_in().
o Transparent union of pfil_packet_t and tricks to fix pointer
  alignment are retired. Internal pfil_realloc() reduces down to
  m_devget() and thus is retired, too.

Reviewed by:		mjg, ocochard
Differential revision:	https://reviews.freebsd.org/D37977
2023-02-14 10:02:49 -08:00
Gleb Smirnoff
a22561501f net: use pfil_mbuf_{in,out} where we always have an mbuf
This finalizes what has been started in 0b70e3e78b.

Reviewed by:		kp, mjg
Differential revision:	https://reviews.freebsd.org/D37976
2023-02-14 10:02:49 -08:00
Mark Johnston
636b19ead4 tcp: Disallow re-connection of a connected socket
soconnectat() tries to ensure that one cannot connect a connected
socket.  However, the check is racy and does not really prevent two
threads from attempting to connect the same TCP socket.

Modify tcp_connect() and tcp6_connect() to perform the check again, this
time synchronized by the inpcb lock, under which we call
soisconnecting().

Reported by:	syzkaller
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D38507
2023-02-14 10:07:19 -05:00
Mark Johnston
c7ea65ec69 inpcb: refcount_release() returns a bool
No functional change intended.

MFC after:	1 week
Sponsored by:	Klara, Inc.
2023-02-13 16:35:47 -05:00
Mark Johnston
775da7f8a9 tcp: Remove a redundant net_epoch entry in tcp6_connect()
tcp6_connect() is always called in a net_epoch read section.

Fixes:		3d76be28ec ("netinet6: require network epoch for in6_pcbconnect()")
Reviewed by:	tuexen, glebius
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38506
2023-02-13 16:35:47 -05:00
Mateusz Guzik
937b00ac0d tcp: add missing void keyword to tcp_stats_init
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-13 18:38:04 +00:00
Mateusz Guzik
e4542107d8 sctp: ansify
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-13 18:17:10 +00:00
Mark Johnston
4130ea611f inpcb: Split in_pcblookup_hash_locked() and clean up a bit
Split the in_pcblookup_hash_locked() function into several independent
subroutine calls, each of which does some kind of hash table lookup.
This refactoring makes it easier to introduce variants of the lookup
algorithm that behave differently depending on whether they are
synchronized by SMR or the PCB database hash lock.

While here, do some related cleanup:
- Remove an unused ifnet parameter from internal functions.  Keep it in
  external functions so that it can be used in the future to derive a v6
  scopeid.
- Reorder the parameters to in_pcblookup_lbgroup() to be consistent with
  the other lookup functions.
- Remove an always-true check from in_pcblookup_lbgroup(): we can assume
  that we're performing a wildcard match.

No functional change intended.

Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D38364
2023-02-09 16:15:03 -05:00
Andrew Gallatin
c0e4090e3d ktls: Accurately track if ifnet ktls is enabled
This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS.  This flag meant that
now, or in the past, ifnet ktls was active on a socket.  Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection.  Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to  ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380
2023-02-09 12:44:44 -05:00