Commit Graph

7232 Commits

Author SHA1 Message Date
Andrew Gallatin
8a7404b2ae tcp: fix leaks in tcp_chg_pacing_rate error paths
tcp_chg_pacing_rate() is expected to release the hw rate limit table,
but failed to do so in several error cases, leading to ever
increasing counts of flows using the rate.

This patch was mostly done by rrs

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34058
Reviewed by: hselasky, rrs,  jhb (inital version, outside of Differential)
2022-01-27 10:35:03 -05:00
Andrew Gallatin
9ba117960e Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to  if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.

This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.

Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks
2022-01-27 10:34:34 -05:00
Gordon Bergling
9e58cca3e8 extra_tcp_stacks: Fix two typos in source code comments
- s/differnt/different/

MFC after;	3 days
2022-01-26 18:02:55 +01:00
Gordon Bergling
b3df222eae extra_tcp_stacks: Fix a few common typos
TCP_BBR:
- Fix a typo introducted in 1b90dfa5d2, which was reported by tuexen@

TCP_RACK:
- Correct two sysctl descriptions: s/corret/correct/

tcp_bbr(4): Also fix s/measurment/measurement/ in the man page

MFC after:	1 week
2022-01-26 10:35:17 +01:00
Wojciech Macek
0daa28057c ip_mroute: add unlock in early-exit
Add missing unlock if V_ip_mrotue is not set

Obtained from:		Semihalf
2022-01-22 14:48:47 +01:00
Wojciech Macek
889c60500d ip_mroute: release epoch lock if mrouter is not configured
Add mising "else" branch to release a lock if mrouter is not
configured.

Obtained from:		Semihalf
Sponsored by:		Stormshield
2022-01-22 11:48:30 +01:00
Wojciech Macek
9ce46cbc95 ip_mroute: move ip_mrouter_done outside lock
X_ip_mrouter_done might sleep, which triggers INVARIANTS to
print additional errors on the screen.
Move it outside the lock, but provide some basic synchronization
to avoid race condition during module uninit/unload.

Obtained from:		Semihalf
Sponsored by:		Stormshield
2022-01-21 06:17:19 +01:00
Wojciech Macek
58630bdd13 Revert "ip_mroute: do not call epoch_waitwhen lock is taken"
This reverts commit 2e72208b6c.
2022-01-21 06:17:19 +01:00
Randall Stewart
aac52f94ea tcp: Warning cleanup from new compiler.
The clang compiler recently got an update that generates warnings of unused
variables where they were set, and then never used. This revision goes through
the tcp stack and cleans all of those up.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision:
2022-01-18 07:41:18 -05:00
Marko Zec
e7abe200c2 fib_algo: shift / mask by constants in dxr_lookup()
Since trie configuration remains invariant during each DXR instance
lifetime, instead of shifting and masking lookup keys by values
computed at runtime, compile upfront several dxr_lookup()
configurations with hardcoded shift / mask constants, and choose the
apropriate lookup function version after each DXR instance rebuild.

In synthetic tests this yields small but measurable (5-10%) lookup
throughput improvement, depending on FIB size and  prefix patterns.

MFC after:	3 days
2022-01-17 00:13:47 +01:00
Gleb Smirnoff
1d41a49404 tcp_usr_connect: report actual error code when stack requests drop 2022-01-13 10:32:41 -08:00
Ryan Stone
3284f4925f LRO: Don't merge ACK and non-ACK packets together
LRO was willing to merge ACK and non-ACK packets together.  This
can cause incorrect th_ack values to be reported up the stack.
While non-ACKs are quite unlikely to appear in practice, LRO's
behaviour is against the spec.  Make LRO unwilling to merge
packets with different TH_ACK flag values in order to fix the
issue.

Found by: Sysunit test
Differential Revision:	https://reviews.freebsd.org/D33775
Reviewed by: rrs
2022-01-13 11:17:58 -05:00
Ryan Stone
24fe6643da LRO: Fix lost packets when merging 1 payload with an ACK
To check if it needed to regenerate a packet's header before
sending it up the stack, LRO was checking if more than one payload
had been merged into the packet.  This failed in the case where
a single payload was merged with one or more pure ACKs.  This
results in lost ACKs.

Fix this by precisely tracking whether header regeneration is
required instead of using an incorrect heuristic.

Found with: Sysunit test
Differential Revision:	https://reviews.freebsd.org/D33774
Reviewed by: rrs
2022-01-13 11:17:48 -05:00
Wojciech Macek
776c34f646 ip_mroute: remove unused variables
Sponsored by:	Stormshield
Obtained from:	Semihalf
2022-01-11 13:06:22 +01:00
Wojciech Macek
2e72208b6c ip_mroute: do not call epoch_waitwhen lock is taken
mrouter_done is called with RAW IP lock taken. Some annoying
printfs are visible on the console if INVARIANTS option is enabled.

Provide atomic-based mechanism which counts enters and exits from/to
critical section in ip_input and ip_output.
Before de-initialization of function pointers ensure (with busy-wait)
that mrouter de-initialization is visible to all readers and that we don't
remove pointers (like ip_mforward etc.) in the middle of packet processing.
2022-01-11 11:19:32 +01:00
Wojciech Macek
68f28dd1cc ip_mroute: do not sleep when lock is taken
Kthread initialization calls uma_alloc which can sleep.
Modify the code to use deferred work instead.
2022-01-11 11:19:32 +01:00
Robert Wing
eb18708ec8 syncache: accept packet with no SA when TCP_MD5SIG is set
When TCP_MD5SIG is set on a socket, all packets are dropped that don't
contain an MD5 signature. Relax this behavior to accept a non-signed
packet when a security association doesn't exist with the peer.

This is useful when a listen socket set with TCP_MD5SIG wants to handle
connections protected with and without MD5 signatures.

Reviewed by:	bz (previous version)
Sponsored by:   nepustil.net
Sponsored by:   Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D33227
2022-01-08 16:32:14 -09:00
Michael Tuexen
f87818eacf sctp: miror change due to upstreaming 2022-01-03 23:03:06 +01:00
Gleb Smirnoff
afad340a14 inpcb: garbage collect INP_LOCK_INIT(), used only once in sctp
Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D33543
2022-01-03 10:20:30 -08:00
Gleb Smirnoff
fec8a8c7cb inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol.  Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE().  Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init().  Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.

Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global.  Historically same
maxsockets value is applied also to every PCB zone.  Important fact:
you can't create a pcb without a socket!  A pcb may outlive its socket,
however.  Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value.  Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier.  When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it.  If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().

Differential revision:	https://reviews.freebsd.org/D33542
2022-01-03 10:17:46 -08:00
Gleb Smirnoff
644ca0846d domains: make domain_init() initialize only global state
Now that each module handles its global and VNET initialization
itself, there is no VNET related stuff left to do in domain_init().

Differential revision:	https://reviews.freebsd.org/D33541
2022-01-03 10:15:22 -08:00
Gleb Smirnoff
89128ff3e4 protocols: init with standard SYSINIT(9) or VNET_SYSINIT
The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
  both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D33537
2022-01-03 10:15:21 -08:00
Kristof Provost
80871aeb0f udp_var.h: other headers already include types.h
Pointed out by:	imp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-01-03 18:35:02 +01:00
Kristof Provost
aa70361d86 headers: make a few more headers self-contained
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-01-03 10:12:30 +01:00
Gordon Bergling
1b90dfa5d2 tcp_bbr(4): Fix a few typos in sysctl descriptions
- s/measurment/measurement/

MFC after:	3 days
2022-01-02 18:03:10 +01:00
Michael Tuexen
502d5e8500 sctp: improve counting of incoming chunks
MFC after:	3 days
2022-01-01 20:59:47 +01:00
Michael Tuexen
4760956e9a udp: use appropriate pcbinfo when signalling EHOSTDOWN
MFC after:	3 days
Sponsored by:	Netflix, Inc.
2022-01-01 19:17:17 +01:00
Michael Tuexen
430df2abee in_pcb: improve inp_next()
If there is no inp to check, exit the loop iterating through them.

Reported by:	syzbot+403406a9cbf082b36ea4@syzkaller.appspotmail.com
Reviewed by:	glebius
Sponsored by:	Netflix, Inc.
2022-01-01 19:04:10 +01:00
Michael Tuexen
1adb91e521 sctp: retire sctp_mtu_size_reset()
Thanks to Timo Voelker for making me aware that sctp_mtu_size_reset()
is very similar to sctp_pathmtu_adjustment().

MFC after:	3 days
2021-12-30 15:30:11 +01:00
Michael Tuexen
2de2ae331b sctp: improve sctp_pathmtu_adjustment()
Allow the resending of DATA chunks to be controlled by the caller,
which allows retiring sctp_mtu_size_reset() in a separate commit.
Also improve the computaion of the overhead and use 32-bit integers
consistently.
Thanks to Timo Voelker for pointing me to the code.

MFC after:	3 days
2021-12-30 15:16:05 +01:00
Alexander V. Chernikov
ff3a85d324 [lltable] Add per-family lltable getters.
Introduce a new function, lltable_get(), to retrieve lltable pointer
 for the specified interface and family.
Use it to avoid all-iftable list traversal when adding or deleting
 ARP/ND records.

Differential Revision: https://reviews.freebsd.org/D33660
MFC after:	2 weeks
2021-12-29 20:57:15 +00:00
Gleb Smirnoff
4287aa5619 tcp_usr_shutdown: don't cast inp_ppcb to tcpcb before checking inp_flags
While here move out one more erroneous condition out of the epoch and
common return.  The only functional change is that if we send control
on a shut down socket we would get EINVAL instead of ECONNRESET.

Reviewed by:	tuexen
Reported by:	syzbot+8388cf7f401a7b6bece6@syzkaller.appspotmail.com
Fixes:		f64dc2ab5b
2021-12-28 08:50:02 -08:00
Michael Tuexen
a7ba00a438 sctp: minor improvements in sctp_get_frag_point
MFC after:	3 days
2021-12-28 10:23:31 +01:00
Michael Tuexen
ca0dd19f09 sctp: check that the computed frag point is a multiple of 4
Reported by:	syzbot+5da189fc1fe80b31f5bd@syzkaller.appspotmail.com
MFC after:	3 days
2021-12-28 09:40:52 +01:00
Gleb Smirnoff
0af4ce4547 tcp_usr_shutdown: don't cast inp_ppcb to tcpcb before checking inp_flags
Fixes:	f64dc2ab5b
2021-12-27 16:58:09 -08:00
Michael Tuexen
989453da05 sctp: cleanup the SCTP_MAXSEG socket option.
This patch makes the handling of the SCTP_MAXSEG socket option
compliant with RFC 6458 (SCTP socket API) and fixes an issue
found by syzkaller.

Reported by:	syzbot+a2791b89ab99121e3333@syzkaller.appspotmail.com
MFC after:	3 days
2021-12-27 23:40:31 +01:00
Gleb Smirnoff
37a7f55737 tcp_usr_rcvd: don't cast inp_ppcb to tcpcb before checking inp_flags
Fixes:	f64dc2ab5b
2021-12-27 10:41:51 -08:00
Michael Tuexen
34ae6a1a44 sctp: cleanup, on functional change intended.
MFC after:	3 days
2021-12-27 18:28:44 +01:00
Michael Tuexen
a859e9f9aa sctp: apply limit for socket buffers as indicated in comment
MFC after:	3 days
2021-12-27 18:15:29 +01:00
Gleb Smirnoff
a057769205 in_pcb: use jenkins hash over the entire IPv6 (or IPv4) address
The intent is to provide more entropy than can be provided
by just the 32-bits of the IPv6 address which overlaps with
6to4 tunnels.  This is needed to mitigate potential algorithmic
complexity attacks from attackers who can control large
numbers of IPv6 addresses.

Together with:		gallatin
Reviewed by:		dwmalone, rscheff
Differential revision:	https://reviews.freebsd.org/D33254
2021-12-26 10:47:28 -08:00
Gleb Smirnoff
eb8dcdeac2 jail: network epoch protection for IP address lists
Now struct prison has two pointers (IPv4 and IPv6) of struct
prison_ip type.  Each points into epoch context, address count
and variable size array of addresses.  These structures are
freed with network epoch deferred free and are not edited in
place, instead a new structure is allocated and set.

While here, the change also generalizes a lot (but not enough)
of IPv4 and IPv6 processing. E.g. address family agnostic helpers
for kern_jail_set() are provided, that reduce v4-v6 copy-paste.

The fast-path prison_check_ip[46]_locked() is also generalized
into prison_ip_check() that can be executed with network epoch
protection only.

Reviewed by:		jamie
Differential revision:	https://reviews.freebsd.org/D33339
2021-12-26 10:45:50 -08:00
Gleb Smirnoff
a370832bec tcp: remove delayed drop KPI
No longer needed after tcp_output() can ask caller to drop.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D33371
2021-12-26 08:48:24 -08:00
Gleb Smirnoff
f64dc2ab5b tcp: TCP output method can request tcp_drop
The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection
when they do output on it.  The default stack never does this, thus
existing framework expects tcp_output() always to return locked and
valid tcpcb.

Provide KPI extension to satisfy demands of advanced stacks.  If the
output method returns negative error code, it means that caller must
call tcp_drop().

In tcp_var() provide three inline methods to call tcp_output():
- tcp_output() is a drop-in replacement for the default stack, so that
  default stack can continue using it internally without modifications.
  For advanced stacks it would perform tcp_drop() and unlock and report
  that with negative error code.
- tcp_output_unlock() handles the negative code and always converts
  it to positive and always unlocks.
- tcp_output_nodrop() just calls the method and leaves the responsibility
  to drop on the caller.

Sweep over the advanced stacks and use new KPI instead of using HPTS
delayed drop queue for that.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D33370
2021-12-26 08:48:19 -08:00
Gleb Smirnoff
dbbcc777de rack: rack_do_compressed_ack_processing() can call tcp_drop()
Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D33369
2021-12-26 08:48:15 -08:00
Gleb Smirnoff
66aeb0b53b rack: drop connection synchronously, when we can
For all functions that are leaves of tcp_input() call
ctf_do_dropwithreset_conn() instead of ctf_do_dropwithreset(), cause
we always got tp and we want it to be dropped.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D33368
2021-12-26 08:48:10 -08:00
Gleb Smirnoff
17ac6b1c14 bbr: drop packet synchronously in ctf_do_dropwithreset_conn()
This function is always called from tcp_do_segment() method, that
can drop tcpcb and return unlocked.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D33367
2021-12-26 08:48:06 -08:00
Gleb Smirnoff
40fa3e40b5 tcp: mechanically substitute call to tfb_tcp_output to new method.
Made with sed(1) execution:

sed -Ef sed -i "" $(grep --exclude tcp_var.h -lr tcp_output sys/)

sed:
s/tp->t_fb->tfb_tcp_output\(tp\)/tcp_output(tp)/
s/to tfb_tcp_output\(\)/to tcp_output()/

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D33366
2021-12-26 08:47:59 -08:00
Gleb Smirnoff
5b08b46a6d tcp: welcome back tcp_output() as the right way to run output on tcpcb.
Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D33365
2021-12-26 08:47:42 -08:00
Bjoern A. Zeeb
f389439f50 IPv4: fix redirect sending conditions
RFC792,1009,1122 state the original conditions for sending a redirect.
RFC1812 further refine these.
ip_forward() still sepcifies the checks originally implemented for these
(we do slightly more/different than suggested as makes sense).
The implementation added in 8ad114c082
to ip_tryforward() however is flawed and may send a "multi-hop"
redirects (to a host not on the directly connected network).

Do proper checks in ip_tryforward() to stop us from sending redirects
in situations we may not.  Keep as much logic out of ip_tryforward()
and in ip_redir_alloc() and only do the mbuf copy once we are sure we
will send a redirect.

While here enhance and fix comments as to which conditions are handled
for sending redirects in various places.

Reported by:		pi (on net@ 2021-12-04)
MFC after:		3 days
Sponsored by:		Dr.-Ing. Nepustil & Co. GmbH
Reviewed by:		cy, others (earlier versions)
Differential Revision:	https://reviews.freebsd.org/D33274
2021-12-26 15:33:48 +00:00
Alexander V. Chernikov
c2c8e360d8 tcp: virtualise net.inet.tcp.msl sysctl.
VNET teardown waits 2*MSL (60 seconds by default) before expiring
 tcp PCBs. These PCBs holds references to nexthops, which, in turn,
 reference ifnets. This chain results in VNET interfaces being destroyed
 and moved to default VNET only after 60 seconds.
Allow tcp_msl to be set in jail by virtualising net.inet.tcp.msl sysctl,
 permitting more predictable VNET tests outcomes.

MFC after:	1 week
Reviewed by:	glebius
Differential Revision: https://reviews.freebsd.org/D33270
2021-12-26 14:56:04 +00:00