Commit Graph

7202 Commits

Author SHA1 Message Date
Gleb Smirnoff
92b3e07229 Enable net.inet.tcp.nolocaltimewait.
This feature has been used for many years at large sites and
didn't show any pitfalls.
2021-10-28 15:34:00 -07:00
Wojciech Macek
8a727c3df8 mroute: add missing WUNLOCK
Add missing WNLOCK as in all other error cases.

Reported by:		Stormshield
Obtained from:		Semihalf
2021-10-28 07:12:23 +02:00
Wojciech Macek
fb3854845f mroute: fix memory leak
Add MFC to linked list to store incoming packets
before MCAST JOIN was captured.

Sponsored by:		Stormshield
Obtained from:		Semihalf
MFC after:		2 weeks
2021-10-28 07:12:16 +02:00
Gleb Smirnoff
5d3bf5b1d2 rack: Update the fast send block on setsockopt(2)
Rack caches TCP/IP header for fast send, so it doesn't call
tcpip_fillheaders().  After certain socket option changes,
namely IPV6_TCLASS, IP_TOS and IP_TTL it needs to update
its fast block to be in sync with the inpcb.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:22:00 -07:00
Gleb Smirnoff
f581a26e46 Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.
Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.

Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:22:00 -07:00
Gleb Smirnoff
de156263a5 Several IP level socket options may affect TCP.
After handling them in IP level ctloutput, pass them down to TCP
ctloutput.

We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place
for now, but comment out how it should be handled.

For IPv4 we are interested in IP_TOS and IP_TTL.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:21:59 -07:00
Gleb Smirnoff
fc4d53cc2e Split tcp_ctloutput() into set/get parts.
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:21:59 -07:00
Peter Lei
e28330832b tcp: socket option to get stack alias name
TCP stack sysctl nodes are currently inserted using the stack
name alias. Allow the user to get the current stack's alias to
allow for programatic sysctl access.

Obtained from:	Netflix
2021-10-27 08:21:59 -07:00
Randall Stewart
12752978d3 tcp: The rack stack can incorrectly have an overflow when calculating a burst delay.
If the congestion window is very large the fact that we multiply it by 1000 (for microseconds) can
cause the uint32_t to overflow and we incorrectly calculate a very small divisor. This will then
cause the burst timer to be very large when it should be 0. Instead lets make the three variables
uint64_t and avoid the issue.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32668
2021-10-26 13:17:58 -04:00
Michael Tuexen
b15b053596 tcp: allow new reno functions to be called from other CC modules
Some new reno functions use the internal data, but are also called
from functions of other CC modules. Ensure that in this case, the
internal data is not accessed.

Reported by:		syzbot+1d219ea351caa5109d4b@syzkaller.appspotmail.com
Reported by:    	syzbot+b08144f8cad9c67258c5@syzkaller.appspotmail.com
Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D32649
2021-10-25 22:53:49 +02:00
Gleb Smirnoff
f2d266f3b0 Don't run ip_ctloutput() for divert socket.
It was here since divert(4) was introduced, probably just came with a
protocol definition boilerplate.  There is no useful socket option
that can be set or get for a divert socket.

Reviewed by:		donner
Differential Revision:	https://reviews.freebsd.org/D32608
2021-10-25 11:16:59 -07:00
Gleb Smirnoff
d89c820b0d Remove div_ctlinput().
This function does nothing since 97d8d152c2. It was introduced
in 252f24a2cf with a sidenote "may not be needed".

Reviewed by:		donner
Differential Revision:	https://reviews.freebsd.org/D32608
2021-10-25 11:16:49 -07:00
Gleb Smirnoff
c8ee75f231 Use network epoch to protect local IPv4 addresses hash.
The modification to the hash are already naturally locked by
in_control_sx.  Convert the hash lists to CK lists. Remove the
in_ifaddr_rmlock. Assert the network epoch where necessary.

Most cases when the hash lookup is done the epoch is already entered.
Cover a few cases, that need entering the epoch, which mostly is
initial configuration of tunnel interfaces and multicast addresses.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D32584
2021-10-22 14:40:53 -07:00
Randall Stewart
4e4c84f8d1 tcp: Add hystart-plus to cc_newreno and rack.
TCP Hystart draft version -03:
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-hystartplusplus

Is a new version of hystart that allows one to carefully exit slow start if the RTT
spikes too much. The newer version has a slower-slow-start so to speak that then
kicks in for five round trips. To see if you exited too early, if not into congestion avoidance.
This commit will add that feature to our newreno CC and add the needed bits in rack to
be able to enable it.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D32373
2021-10-22 07:10:28 -04:00
Roy Marples
5c5340108e net: Allow binding of unspecified address without address existance
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).

An equivalent test has existed since 4.4-Lite.  It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).

In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es).  In practice no work will be
avoided.

Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses.  It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.

The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs.  As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.

PR:		253166
Reviewed by:	ae, bz, donner, emaste, glebius
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32563
2021-10-20 19:25:51 -04:00
Gleb Smirnoff
9b7501e797 in_mcast: garbage collect inp_gcmoptions()
It is is used only once, merge it into inp_freemoptions().
2021-10-18 11:36:07 -07:00
Gleb Smirnoff
0f617ae48a Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c. 2021-10-18 10:19:57 -07:00
Gleb Smirnoff
744a64bd92 in_pcb: garbage collect in_pcbrele() 2021-10-18 10:07:16 -07:00
Gleb Smirnoff
5a78df20ce in_pcb: garbage collect unused structure in_pcblist 2021-10-18 10:06:39 -07:00
Maxim Sobolev
461e6f23db Fix fragmented UDP packets handling since rev.360967.
Consider IP_MF flag when checking length of the UDP packet to
match the declared value.

Sponsored by:	Sippy Software, Inc.
Differential Revision:	https://reviews.freebsd.org/D32363
MFC after:	2 weeks
2021-10-15 16:48:12 -07:00
Gleb Smirnoff
2144431c11 Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.
An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.

Next step would be to CK-ify the in_addr hash and get rid of the...

Reviewed by:		melifaro
Differential Revision:	https://reviews.freebsd.org/D32434
2021-10-13 10:04:46 -07:00
Marko Zec
bc8b8e106b [fib_algo][dxr] Retire counters which are no longer used
The number of chunks can still be tracked via vmstat -z|fgrep dxr.

MFC after:	3 days
2021-10-09 13:47:10 +02:00
Marko Zec
1549575f22 [fib_algo][dxr] Improve incremental updating strategy
Tracking the number of unused holes in the trie and the range table
was a bad metric based on which full trie and / or range rebuilds
were triggered, which would happen in vain by far too frequently,
particularly with live BGP feeds.

Instead, track the total unused space inside the trie and range table
structures, and trigger rebuilds if the percentage of unused space
exceeds a sysctl-tunable threshold.

MFC after:	3 days
PR:		257965
2021-10-09 13:22:27 +02:00
Michael Tuexen
bd19202c92 sctp: improve KASSERT messages
MFC after:	1 week
2021-10-08 11:33:56 +02:00
Michael Tuexen
3ff3733991 sctp: don't keep being locked on a stream which is removed
Reported by:	syzbot+f5f551e8a3a0302a4914@syzkaller.appspotmail.com
MFC after:	1 week
2021-10-02 00:48:01 +02:00
Randall Stewart
a36230f75e tcp: Make dsack stats available in netstat and also make sure its aware of TLP's.
DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics
on DSACKs however are very useful in figuring out how much bad retransmissions you
are doing. This is further complicated, however, by stacks that do TLP. A TLP
when discovering a lost ack in the reverse path will cause the generation
of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well
as the more traditional dsack-bytes and dsack-packets. These will now
all display in netstat -p tcp -s. This also updates all stacks that
are currently built to keep track of these stats.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32158
2021-10-01 10:36:27 -04:00
Michael Tuexen
28ea947078 sctp: provide a specific stream scheduler function for FCFS
A KASSERT in the genric routine does not apply and triggers
incorrectly.

Reported by:	syzbot+8435af157238c6a11430@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-29 02:08:37 +02:00
Michael Tuexen
fa947a3687 sctp: cleanup and adding KASSERT()s, no functional change
MFC after:	1 week
2021-09-28 20:31:12 +02:00
Michael Tuexen
5b53e749a9 sctp: fix usage of stream scheduler functions
sctp_ss_scheduled() should only be called for streams that are
scheduled. So call sctp_ss_remove_from_stream() before it.
This bug was uncovered by the earlier cleanup.

Reported by:	syzbot+bbf739922346659df4b2@syzkaller.appspotmail.com
Reported by:	syzbot+0a0857458f4a7b0507c8@syzkaller.appspotmail.com
Reported by:	syzbot+a0b62c6107b34a04e54d@syzkaller.appspotmail.com
Reported by:	syzbot+0aa0d676429ebcd53299@syzkaller.appspotmail.com
Reported by:	syzbot+104cc0c1d3ccf2921c1d@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-28 05:25:58 +02:00
Michael Tuexen
171633765c sctp: avoid locking an already locked mutex
Reported by:	syzbot+f048680690f2e8d7ddad@syzkaller.appspotmail.com
Reported by:	syzbot+0725c712ba89d123c2e9@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-28 05:17:03 +02:00
Gordon Bergling
d2e616147d sctp: Fix a typo in a comment
- s/assue/assume/

MFC after:	3 days
2021-09-26 15:15:39 +02:00
Marko Zec
43880c511c [fib_algo][dxr] Split unused range chunk list in multiple buckets
Traversing a single list of unused range chunks in search for a block
of optimal size was suboptimal.

The experience with real-world BGP workloads has shown that on average
unused range chunks are tiny, mostly in length from 1 to 4 or 5, when
DXR is configured with K = 20 which is the current default (D16X4R).

Therefore, introduce a limited amount of buckets to accomodate descriptors
of empty blocks of fixed (small) size, so that those can be found in O(1)
time.  If no empty chunks of the requested size can be found in fixed-size
buckets, the search continues in an unsorted list of empty chunks of
variable lengths, which should only happen infrequently.

This change should permit us to manage significantly more empty range
chunks without sacrifying the speed of incremental range table updating.

MFC after:	3 days
2021-09-25 06:29:48 +02:00
Randall Stewart
1ca931a540 tcp: Rack compressed ack path updates the recv window too easily
The compressed ack path of rack is not following proper procedures in updating
the peers window. It should be checking the seq and ack values before updating and
instead it is blindly updating the values. This could in theory get the wrong window
in the connection for some length of time.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32082
2021-09-23 11:43:29 -04:00
Randall Stewart
fd69939e79 tcp: Two bugs in rack one of which can lead to a panic.
In extensive testing in NF we have found two issues inside
the rack stack.

1) An incorrect offset is being generated by the fast send path when a fast send is initiated on
   the end of the socket buffer and before the fast send runs, the sb_compress macro adds data to the trailing socket.
   This fools the fast send code into thinking the sb offset changed and it miscalculates a "updated offset".
   It should only do that when the mbuf in question got smaller.. i.e. an ack was processed. This can lead to
   a panic deref'ing a NULL mbuf if that packet is ever retransmitted. At the best case it leads to invalid data being
   sent to the client which usually terminates the connection. The fix is to have the proper logic (that is in the rsm fast path)
   to make sure we only update the offset when the mbuf shrinks.
2) The other issue is more bothersome. The timestamp check in rack needs to use the msec timestamp when
   comparing the timestamp echo to now. It was using a microsecond timestamp which ends up giving error
   prone results but causes only small harm in trying to identify which send to use in RTT calculations if its a retransmit.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32062
2021-09-23 10:54:23 -04:00
Michael Tuexen
414499b3f9 sctp: Cleanup stream schedulers.
No functional change intended.

MFC after:	1 week
2021-09-23 14:16:56 +02:00
Michael Tuexen
762ae0ec8d sctp: Simplify stream scheduler usage
Callers are getting the stcb send lock, so just KASSERT that.
No need to signal this when calling stream scheduler functions.
No functional change intended.

MFC after:	1 week
2021-09-21 17:13:57 +02:00
Michael Tuexen
0b79a76f84 sctp: improve consistency when calling stream scheduler
Hold always the stcb send lock when calling sctp_ss_init() and
sctp_ss_remove_from_stream().

MFC after:	1 week
2021-09-21 00:54:13 +02:00
Michael Tuexen
34b1efcea1 sctp: use a valid outstream when adding it to the scheduler
Without holding the stcb send lock, the outstreams might get
reallocated if the number of streams are increased.

Reported by:	syzbot+4a5431d7caa666f2c19c@syzkaller.appspotmail.com
Reported by:	syzbot+aa2e3b013a48870e193d@syzkaller.appspotmail.com
Reported by:	syzbot+e4368c3bde07cd2fb29f@syzkaller.appspotmail.com
Reported by:	syzbot+fe2f110e34811ea91690@syzkaller.appspotmail.com
Reported by:	syzbot+ed6e8de942351d0309f4@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-20 15:52:10 +02:00
Marko Zec
2ac039f7be [fib_algo][dxr] Merge adjacent empty range table chunks.
MFC after:	3 days
2021-09-20 06:30:45 +02:00
Michael Tuexen
e19d93b19d sctp: fix FCFS stream scheduler
Reported by:	syzbot+c6793f0f0ce698bce230@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-19 11:56:26 +02:00
Mark Johnston
bf25678226 ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden.  Fix this by using a separate output
parameter.  Convert to the new socket buffer locking macros while here.

Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31977
2021-09-17 14:19:05 -04:00
Mike Karels
fd0765933c Change lowest address on subnet (host 0) not to broadcast by default.
The address with a host part of all zeros was used as a broadcast long
ago, but the default has been all ones since 4.3BSD and RFC1122.  Until
now, we would broadcast the host zero address as well as the configured
address.  Change to not broadcasting that address by default, but add a
sysctl (net.inet.ip.broadcast_lowest) to re-enable it.  Note that the
correct way to use the zero address for broadcast would be to configure
it as the broadcast address for the network.

See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/
and the discussion in https://reviews.freebsd.org/D19316.  Note, Linux
now implements this.

Reviewed by:	rgrimes, tuexen; melifaro (previous version)
MFC after:	1 month
Relnotes:	yes
Differential Revision: https://reviews.freebsd.org/D31861
2021-09-16 19:42:20 -05:00
Marko Zec
eb3148cc4d [fib algo][dxr] Fix division by zero.
A division by zero would occur if DXR would be activated on a vnet
with no IP addresses configured on any interfaces.

PR:		257965
MFC after:	3 days
Reported by:	Raul Munoz
2021-09-16 16:34:05 +02:00
Marko Zec
b51f8bae57 [fib algo][dxr] Optimize trie updating.
Don't rebuild in vain trie parts unaffected by accumulated incremental
RIB updates.

PR:		257965
Tested by:	Konrad Kreciwilk
MFC after:	3 days
2021-09-15 22:42:49 +02:00
Marko Zec
442c8a245e [fib algo][dxr] Fix undefined behavior.
The result of shifting uint32_t by 32 (or more) is undefined: fix it.
2021-09-15 22:42:48 +02:00
Hans Petter Selasky
e3e7d95332 tcp: Avoid division by zero when KERN_TLS is enabled in tcp_account_for_send().
If the "len" variable is non-zero, we can assume that the sum of
"tp->t_snd_rxt_bytes + tp->t_sndbytes" is also non-zero.

It is also assumed that the 64-bit byte counters will never wrap around.

Differential Revision:	https://reviews.freebsd.org/D31959
Reviewed by:	gallatin, rrs and tuexen
Found by:	"I told you so", also called hselasky
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2021-09-15 18:05:31 +02:00
Michael Tuexen
4542164685 sctp: cleanup, no functional change intended
MFC after:	1 week
2021-09-15 10:18:11 +02:00
John Baldwin
c782ea8bb5 Add a switch structure for send tags.
Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'.  A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).

Previously, device driver ifnet methods switched on the type to call
type-specific functions.  Now, those type-specific functions are saved
in the switch structure and invoked directly.  In addition, this more
gracefully permits multiple implementations of the same tag within a
driver.  In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.

Reviewed by:	gallatin, hselasky, kib (older version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D31572
2021-09-14 11:43:41 -07:00
Mark Johnston
e6c19aa94d sctp: Allow blocking on I/O locks even with non-blocking sockets
There are two flags to request a non-blocking receive on a socket:
MSG_NBIO and MSG_DONTWAIT.  They are handled a bit differently in that
soreceive_generic() and soreceive_stream() will block on the socket I/O
lock when MSG_NBIO is set, but not if MSG_DONTWAIT is set.  In general,
MSG_NBIO seems to mean, "don't block if there is no data to receive" and
MSG_DONTWAIT means "don't go to sleep for any reason".

SCTP's soreceive implementation did not allow blocking on the I/O lock
if either flag is set, but this violates an assumption in
aio_process_sb(), which specifies MSG_NBIO but nonetheless
expects to make progress if data is available to read.  Change
sctp_sorecvmsg() to block on the I/O lock only if MSG_DONTWAIT
is not set.

Reported by:	syzbot+c7d22dbbb9aef509421d@syzkaller.appspotmail.com
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31915
2021-09-14 09:02:05 -04:00
Michael Tuexen
29545986bd sctp: avoid LOR
Don't lock the inp-info lock while holding an stcb lock.

MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D31921
2021-09-12 21:11:14 +02:00
Michael Tuexen
4181fa2a20 sctp: minor cleanup, no functional change
MFC after:	1 week
2021-09-12 19:21:15 +02:00
Mark Johnston
2d5c48eccd sctp: Tighten up locking around sctp_aloc_assoc()
All callers of sctp_aloc_assoc() mark the PCB as connected after a
successful call (for one-to-one-style sockets).  In all cases this is
done without the PCB lock, so the PCB's flags can be corrupted.  We also
do not atomically check whether a one-to-one-style socket is a listening
socket, which violates various assumptions in solisten_proto().

We need to hold the PCB lock across all of sctp_aloc_assoc() to fix
this.  In order to do that without introducing lock order reversals, we
have to hold the global info lock as well.

So:
- Convert sctp_aloc_assoc() so that the inp and info locks are
  consistently held.  It returns with the association lock held, as
  before.
- Fix an apparent bug where we failed to remove an association from a
  global hash if sctp_add_remote_addr() fails.
- sctp_select_a_tag() is called when initializing an association, and it
  acquires the global info lock.  To avoid lock recursion, push locking
  into its callers.
- Introduce sctp_aloc_assoc_connected(), which atomically checks for a
  listening socket and sets SCTP_PCB_FLAGS_CONNECTED.

There is still one edge case in sctp_process_cookie_new() where we do
not update PCB/socket state correctly.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31908
2021-09-11 10:15:21 -04:00
orange30
f5777c123a net: Fix memory leaks upon arp_fillheader() failures
Free memory before return from arprequest_internal().  In in_arpinput(),
if arp_fillheader() fails, it should use goto drop.

Reviewed by:	melifaro, imp, markj
MFC after:	1 week
Pull Request:	https://github.com/freebsd/freebsd-src/pull/534
2021-09-10 09:45:26 -04:00
Michael Tuexen
3ea2cdd45e sctp: add explicit cast, no functional change intended
MFC after:	3 days
2021-09-09 19:13:47 +02:00
Michael Tuexen
0c1a20beb4 sctp: use appropriate argument when freeing association
Reported by:	syzbot+7fe26e26911344e7211d@syzkaller.appspotmail.com
MFC after:	3 days
2021-09-09 18:01:35 +02:00
Mark Johnston
4250aa1188 sctp: Clear assoc socket references when freeing a PCB
This restores behaviour present in the first import of SCTP.  Commit
ceaad40ae7 commented this out and commit
62fb761ff2 removed it.  However, once
sctp_inpcb_free() returns, the socket reference is gone no matter what,
so we need to clear it.

Reported by:	syzbot+30dd69297fcbc5f0e10a@syzkaller.appspotmail.com
Reported by:	syzbot+7b2f9d4bcac1c9569291@syzkaller.appspotmail.com
Reported by:	syzbot+ed3e651f7d040af480a6@syzkaller.appspotmail.com
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31886
2021-09-09 08:33:26 -04:00
Michael Tuexen
58a7bf124c sctp: cleanup timewait handling for vtags
MFC after:	1 week
2021-09-09 01:18:58 +02:00
Mark Johnston
ee4731179c sctp: Fix a lock order reversal in sctp_swap_inpcb_for_listen()
When port reuse is enabled in a one-to-one-style socket, sctp_listen()
may call sctp_swap_inpcb_for_listen() to move the PCB out of the "TCP
pool".  In so doing it will drop the PCB lock, yielding an LOR since we
now hold several socket locks.  Reorder sctp_listen() so that it
performs this operation before beginning the conversion to a listening
socket.  Also modify sctp_swap_inpcb_for_listen() to return with PCB
write-locked, since that's what sctp_listen() expects now.

Reviewed by:	tuexen
Fixes:	bd4a39cc93 ("socket: Properly interlock when transitioning to a listening socket")
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31879
2021-09-08 11:41:19 -04:00
Mark Johnston
6e3af6321b sctp: Fix lock recursion in sctp_swap_inpcb_for_listen()
After commit bd4a39cc93 we now hold the global inp info lock across
the call to sctp_swap_inpcb_for_listen(), which attempts to acquire it
again.  Since sctp_swap_inpcb_for_listen()'s sole caller is
sctp_listen(), we can simply change it to not try to acquire the lock.

Reported by:	syzbot+a76b19ea2f8e1190c451@syzkaller.appspotmail.com
Reported by:	syzbot+a1b6cef257ad145b7187@syzkaller.appspotmail.com
Reviewed by:	tuexen
Fixes:	bd4a39cc93 ("socket: Properly interlock when transitioning to a listening socket")
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31878
2021-09-08 11:41:18 -04:00
Michael Tuexen
aab1d593b2 sctp: minor cleanups, no functional change intended 2021-09-08 15:13:49 +02:00
Alexander V. Chernikov
4b631fc832 routing: fix source address selection rules for IPv4 over IPv6.
Current logic always selects an IFA of the same family from the
 outgoing interfaces. In IPv4 over IPv6 setup there can be just
 single non-127.0.0.1 ifa, attached to the loopback interface.

Create a separate rt_getifa_family() to handle entire ifa selection
 for the IPv4 over IPv6.

Differential Revision: https://reviews.freebsd.org/D31868
MFC after:	1 week
2021-09-07 21:41:05 +00:00
Mark Johnston
c4b44adcf0 sctp: Remove special handling for a listen(2) backlog of 0
... when applied to one-to-one-style sockets.  sctp_listen() cannot be
used to toggle the listening state of such a socket.  See RFC 6458's
description of expected listen(2) semantics for one-to-one- and
one-to-many-style sockets.

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31774
2021-09-07 17:12:09 -04:00
Mark Johnston
bd4a39cc93 socket: Properly interlock when transitioning to a listening socket
Currently, most protocols implement pru_listen with something like the
following:

	SOCK_LOCK(so);
	error = solisten_proto_check(so);
	if (error) {
		SOCK_UNLOCK(so);
		return (error);
	}
	solisten_proto(so);
	SOCK_UNLOCK(so);

solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.

The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called.  Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).

This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.

Reported by:	syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31659
2021-09-07 17:11:43 -04:00
Mark Johnston
f94acf52a4 socket: Rename sb(un)lock() and interlock with listen(2)
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK().  These operate on a
socket rather than a socket buffer.  Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket.  Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before.  In
particular, add checks to sendfile() and sorflush().

Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31657
2021-09-07 15:06:48 -04:00
Mark Johnston
173a7a4ee4 sctp: Fix iterator synchronization in sctp_sendall()
- The SCTP_PCB_FLAGS_SND_ITERATOR_UP check was racy, since two threads
  could observe that the flag is not set and then both set it.  I'm not
  sure if this is actually a problem in practice, i.e., maybe there's no
  problem having multiple sends for a single PCB in the iterator list?
- sctp_sendall() was modifying sctp_flags without the inp lock held.

The change simply acquires the PCB write lock before toggling the flag,
fixing both problems.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31813
2021-09-07 11:19:29 -04:00
Mark Johnston
e8e23ec127 sctp: Remove an unused sctp_inpcb field
This appears to be unused in usrsctp as well.  No functional change
intended.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31812
2021-09-07 11:19:29 -04:00
Mark Johnston
c17b531bed sctp: Fix races around sctp_inpcb_free()
sctp_close() and sctp_abort() disassociate the PCB from its socket.
As a part of this, they attempt to free the PCB, which may end up
lingering.  Fix some bugs in this area:

- For some reason, sctp_close() and sctp_abort() set
  SCTP_PCB_FLAGS_SOCKET_GONE using an atomic compare-and-set without the
  PCB lock held.  This is racy since sctp_flags is normally updated
  without atomics, using the PCB lock to synchronize.  So, the update
  can be lost, which can cause all sort of races with other SCTP
  components which look for the _GONE flag.  Fix the problem simply by
  acquiring the PCB lock in order to set the flag.  Note that we have to
  drop and re-acquire the lock again in sctp_inpcb_free(), but I don't
  see a good way around that for now.  If it's a real problem, the _GONE
  flag could be split out of sctp_flags and into a dedicated sctp_inpcb
  field.
- In sctp_inpcb_free(), load sctp_socket after acquiring the PCB lock,
  to avoid possible races with parallel sctp_inpcb_free() calls.
- Add an assertion sctp_inpcb_free() to verify that _ALLGONE is not set.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31811
2021-09-07 11:19:29 -04:00
Alexander V. Chernikov
936f4a42fa lltable: do not require prefix lookup when checking lle allocation rules.
With the new FIB_ALGO infrastructure, nearly all subsystems use
 fib[46]_lookup() functions, which provides lockless lookups.
A number of places remains that uses old-style lookup functions, that
 still requires RIB read lock to return the result. One of such places
 is arp processing code.
FIB_ALGO implementation makes some tradeoffs, resulting in (relatively)
 prolonged periods of holding RIB_WLOCK. If the lock is held and datapath
 competes for it, the RX ring may get blocked, ending in traffic delays and losses.
As currently arp processing is performed directly in the interrupt handler,
 handling ARP replies triggers the problem descibed above when the amount of
 ARP replies is high.

To be more specific, prior to creating new ARP entry, routing lookup for the entry
 address in interface fib is executed. The following conditions are the verified:

1. If lookup returns an empty result, or the resulting prefix is non-directly-reachable,
 failure is returned. The only exception are host routes w/ gateway==address.
2. If the routing lookup returns different interface and non-host route,
 we want to support the use case of having multiple interfaces with the same prefix.
 In fact, the current code just checks if the returned prefix covers target address
 (always true) and effectively allow allocating ARP entries for any directly-reachable prefix,
 regardless of its interface.

Change the code to perform the following:

1) use fib4_lookup() to get the nexthop, instead of requesting exact prefix.
2) Rewrite first condition check using nexthop flags (1:1 match)
3) Rewrite second condition to check for interface addresses matching target address on
 the input interface.

Differential Revision: https://reviews.freebsd.org/D31824
Reviewed by:	ae
MFC after:	1 week
PR:	257965
2021-09-06 21:03:22 +00:00
Gordon Bergling
631504fb34 Fix a common typo in source code comments
- s/existant/existent/

MFC after:	3 days
2021-09-04 12:56:57 +02:00
Mark Johnston
c98bf2a45e sctp: Always check for a vanishing inpcb when processing COOKIE-ECHO
We previously did this only in the normal case where no association
exists yet.  However, it is not safe to process COOKIE-ECHO even if an
association exists, as sctp_process_cookie_existing() may dereference
the socket pointer.

See also commit 0c7dc84076.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31755
2021-09-01 10:28:17 -04:00
Mark Johnston
d35be50f57 sctp: Hold association locks across socket wakeups when freeing
At this point we do not hold the inpcb lock, so the only thing holding
the socket reference live is the TCB lock, which needs to be acquired by
sctp_inpcb_free() in order to destroy associations.  Defer the unlock to
until after we dereference the socket reference.

Reported by:	syzbot+1d0f2c4675de76a4cf1e@syzkaller.appspotmail.com
Reported by:	syzbot+fabee77954fe69d3a5ad@syzkaller.appspotmail.com
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31754
2021-09-01 10:27:51 -04:00
Mark Johnston
65f30a39e1 sctp: Release the socket reference when detaching an association
Later in sctp_free_assoc(), when we clean up chunk lists,
sctp_free_spbufspace() is used to reset the byte count in the socket
send buffer.  However, if the PCB is going away, the socket may already
have been detached from the PCB, in which case this becomes a use-after
free.  Clear the socket reference from the association before detaching
it from the PCB, if the PCB has already lost its socket reference.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31753
2021-09-01 10:27:31 -04:00
Mark Johnston
457abbb857 sctp: Implement sctp_inpcb_bind_locked()
This will be used by sctp_listen() to avoid dropping locks when
performing an implicit bind.  No functional change intended.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31757
2021-09-01 10:06:18 -04:00
Mark Johnston
be8ee77e9e sctp: Add macros to assert on inp info lock state
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31756
2021-09-01 10:06:18 -04:00
Mark Johnston
4a36122b1d sctp: Fix racy UNBOUND flag check in sctp_inpcb_bind()
SCTP needs to avoid binding a given socket twice.  The check used to
avoid this is racy since neither the inpcb lock nor the global info lock
is held.  Fix it by synchronizing using the global info lock.  In
particular, sctp_inpcb_bind() may drop the inpcb lock in some cases, but
the info lock is sufficient to prevent double insertion into PCB hash
tables.

Reported by:	syzbot+548a8560d959669d0e12@syzkaller.appspotmail.com
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31734
2021-08-31 07:44:42 -04:00
Mark Johnston
2496d812a9 sctp: Simplify the free port search in sctp_inpcb_bind()
Eliminate a flag variable and reduce indentation.  No functional change
intended.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31733
2021-08-31 07:43:39 -04:00
Mark Johnston
93908fce72 sctp: Avoid unnecessary refcount bumps in sctp_inpcb_bind()
We only drop the inp lock when binding to a specific port.  So, only
acquire an extra reference when required.  This simplifies error
handling a bit.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31732
2021-08-31 07:43:27 -04:00
Mark Johnston
0d29e4bc01 sctp: Remove always-false checks in sctp_inpcb_bind()
No functional change intended.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31731
2021-08-31 07:43:13 -04:00
Gordon Bergling
586c9dc374 inet(3): Fix a few common typos in source code comments
- s/funtion/function/

MFC after:	3 days
2021-08-28 18:53:02 +02:00
Mark Johnston
d174534a27 tcp: Remove unused v6 state definitions
These are supposedly for compatibility with KAME, but they are
completely unused in our tree and don't exist in OpenBSD or NetBSD.

Reviewed by:	kbowling, bz, gnn
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31700
2021-08-27 08:31:32 -04:00
Artem Khramov
620cf65c2b netinet: prevent NULL pointer dereference in in_aifaddr_ioctl()
It appears that maliciously crafted ifaliasreq can lead to NULL
pointer dereference in in_aifaddr_ioctl(). In order to replicate
that, one needs to

1. Ensure that carp(4) is not loaded

2. Issue SIOCAIFADDR call setting ifra_vhid field of the request
   to a negative value.

A repro code would look like this.

int main() {
    struct ifaliasreq req;
    struct sockaddr_in sin, mask;
    int fd, error;

    bzero(&sin, sizeof(struct sockaddr_in));
    bzero(&mask, sizeof(struct sockaddr_in));

    sin.sin_len = sizeof(struct sockaddr_in);
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = inet_addr("192.168.88.2");

    mask.sin_len = sizeof(struct sockaddr_in);
    mask.sin_family = AF_INET;
    mask.sin_addr.s_addr = inet_addr("255.255.255.0");

    fd = socket(AF_INET, SOCK_DGRAM, 0);
    if (fd < 0)
        return (-1);

    memset(&req, 0, sizeof(struct ifaliasreq));
    strlcpy(req.ifra_name, "lo0", sizeof(req.ifra_name));
    memcpy(&req.ifra_addr, &sin, sin.sin_len);
    memcpy(&req.ifra_mask, &mask, mask.sin_len);
    req.ifra_vhid = -1;

    return ioctl(fd, SIOCAIFADDR, (char *)&req);
}

To fix, discard both positive and negative vhid values in
in_aifaddr_ioctl, if carp(4) is not loaded. This prevents NULL pointer
dereference and kernel panic.

Reviewed by:	imp@
Pull Request:	https://github.com/freebsd/freebsd-src/pull/530
2021-08-26 12:08:03 -06:00
Michael Tuexen
dc6ab77d66 tcp: make network epoch expectations of LRO explicit
Reviewed by:		gallatin, hselasky
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D31648
2021-08-25 17:12:36 +02:00
Zhenlei Huang
62e1a437f3 routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549).
Implement kernel support for RFC 5549/8950.

* Relax control plane restrictions and allow specifying IPv6 gateways
 for IPv4 routes. This behavior is controlled by the
 net.route.rib_route_ipv6_nexthop sysctl (on by default).

* Always pass final destination in ro->ro_dst in ip_forward().

* Use ro->ro_dst to exract packet family inside if_output() routines.
 Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.

* Pass extracted family to nd6_resolve() to get the LLE with proper encap.
 It leverages recent lltable changes committed in c541bd368f.

Presence of the functionality can be checked using ipv4_rfc5549_support feature(3).
Example usage:
  route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0

Differential Revision: https://reviews.freebsd.org/D30398
MFC after:	2 weeks
2021-08-22 22:56:08 +00:00
Alexander V. Chernikov
c541bd368f lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries.
Currently we use pre-calculated headers inside LLE entries as prepend data
 for `if_output` functions. Using these headers allows saving some
 CPU cycles/memory accesses on the fast path.

However, this approach makes adding L2 header for IPv4 traffic with IPv6
 nexthops more complex, as it is not possible to store multiple
 pre-calculated headers inside lle. Additionally, the solution space is
 limited by the fact that PCB caching saves LLEs in addition to the nexthop.

Thus, add support for creating special "child" LLEs for the purpose of holding
 custom family encaps and store mbufs pending resolution. To simplify handling
 of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
 Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
 to the "parent" LLE - it is not possible to delete "child" when parent is alive.
 Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
 machine used by the standard LLEs.

nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
 allowing to return such child LLEs. This change uses `LLE_SF()` macro which
 packs family and flags in a single int field. This is done to simplify merging
 back to stable/. Once this code lands, most of the cases will be converted to
 use a dedicated `family` parameter.

Differential Revision: https://reviews.freebsd.org/D31379
MFC after:	2 weeks
2021-08-21 17:34:35 +00:00
Michael Tuexen
a3665770d7 sctp: improve handling of illegal parameters of INIT-ACK chunks
MFC after:	3 days
2021-08-20 14:06:41 +02:00
Luiz Otavio O Souza
20ffd88ed5 ipfw: use unsigned int for dummynet bandwidth
This allows the maximum value of 4294967295 (~4Gb/s) instead of previous
value of 2147483647 (~2Gb/s).

Reviewed by:	np, scottl
Obtained from:	pfSense
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31582
2021-08-19 10:48:53 +02:00
Michael Tuexen
eba8e643b1 sctp: improve handling of INIT chunks with invalid parameters
MFC after:	3 days
2021-08-19 00:33:28 +02:00
Randall Stewart
5baf32c97a tcp: Add support for DSACK based reordering window to rack.
The rack stack, with respect to the rack bits in it, was originally built based
on an early I-D of rack. In fact at that time the TLP bits were in a separate
I-D. The dynamic reordering window based on DSACK events was not present
in rack at that time. It is now part of the RFC and we need to update our stack
to include these features. However we want to have a way to control the feature
so that we can, if the admin decides, make it stay the same way system wide as
well as via socket option. The new sysctl and socket option has the following
meaning for setting:

00 (0) - Keep the old way, i.e. reordering window is 1 and do not use DSACK bytes to add to reorder window
01 (1) - Change the Reordering window to 1/4 of an RTT but do not use DSACK bytes to add to reorder window
10 (2) - Keep the reordering window as 1, but do use SACK bytes to add additional 1/4 RTT delay to the reorder window
11 (3) - reordering window is 1/4 of an RTT and add additional DSACK bytes to increase the reordering window (RFC behavior)

The default currently in the sysctl is 3 so we get standards based behavior.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31506
2021-08-17 16:29:22 -04:00
Mateusz Guzik
3be3cbe06d ip_reass: do less work in ipreass_slowtimo if possible
ipreass_slowtimo avoidably uses CPU on otherwise idle boxes

Reviewed by:	kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31526
2021-08-14 18:50:12 +02:00
Mateusz Guzik
d2b95af1c2 ip_reass: drop the volatile keyword from nfrags and mark with __exclusive_cache_line
The keyword adds nothing as all operations on the var are performed
through atomic_*

Reviewed by:	kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31526
2021-08-14 18:49:30 +02:00
Wojciech Macek
f61cb12aac mroute: fix locking issues
In some cases the code may fall into deadlock.
Avoid calling epoch_wait when W-lock is taken.

Sponsored by:	Stormshield
Obtained from:	Semihalf
2021-08-13 11:06:17 +02:00
Eric van Gyzen
13a58148de netdump: send key before dump, in case dump fails
Previously, if an encrypted netdump failed, such as due to a timeout or
network failure, the key was not saved, so a partial dump was
completely useless.

Send the key first, so the partial dump can be decrypted, because even a
partial dump can be useful.

Reviewed by:	bdrewery, markj
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D31453
2021-08-11 10:54:56 -05:00
Michael Tuexen
3808ab732e sctp: remove some set, but unused variables
Thanks to pkasting for submitting the patch for the userland stack.

MFC after:	3 days
2021-08-09 15:58:46 +02:00
Wojciech Macek
d9d59bb1af ipsec: Handle ICMP NEEDFRAG message.
It will be needed for upcoming PMTU implementation in ipsec.
    For now simply create/update an entry in tcp hostcache when needed.
    The code is based on https://people.freebsd.org/~ae/ipsec_transport_mode_ctlinput.diff

Authored by: 		Kornel Duleba <mindal@semihalf.com>
Differential revision: 	https://reviews.freebsd.org/D30992
Reviewed by: 		tuxen
Sponsored by: 		Stormshield
Obtained from:	 	Semihalf
2021-08-09 12:01:46 +02:00
Alexander V. Chernikov
9748eb7427 Simplify nhop operations in ip_output().
Consistently use `nh` instead of always dereferencing
 ro->ro_nh inside the if block.
Always use nexthop mtu, as it provides guarantee that mtu is accurate.
Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses
 of updating ro flags based on different nexthop.

Differential Revision: https://reviews.freebsd.org/D31451
Reviewed by:	kp
MFC after: 2 weeks
2021-08-08 09:19:27 +00:00
Gordon Bergling
04389c855e Fix some common typos in comments
- s/configuraiton/configuration/
- s/specifed/specified/
- s/compatiblity/compatibility/

MFC after:	5 days
2021-08-08 10:16:06 +02:00
Michael Tuexen
112899c6af sctp: improve input validation of mapped addresses in sctp_connectx()
MFC after:	3 days
2021-08-07 15:12:09 +02:00
Michael Tuexen
b732091a76 sctp: improve input validation of mapped addresses in send()
Reported by:	syzbot+35528f275f2eea6317cc@syzkaller.appspotmail.com
Reported by:	syzbot+ac29916d5f16d241553d@syzkaller.appspotmail.com
MFC after:	3 days
2021-08-07 14:50:40 +02:00
Hans Petter Selasky
bb5cd80e8b Update the TCP LRO code to handle both encrypted and un-encrypted traffic.
Encrypted and un-encrypted traffic needs to be coalesced separately.
Split the 16-bit lro_type field in the address information into two
8-bit fields, and then use the last 8-bit field for flags, which among
other indicate if the received mbuf is encrypted or un-encrypted.

Differential Revision:	https://reviews.freebsd.org/D31377
Reviewed by:	gallatin
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2021-08-06 11:28:44 +02:00
Andrew Gallatin
739de953ec ktls: Move KERN_TLS ifdef to tcp_var.h
This allows us to remove stubs in ktls.h and allows us
to sort the function prototypes.

Reviewed by: jhb
Sponsored by: Netflix
2021-08-05 19:17:35 -04:00
Alexander V. Chernikov
8482aa7748 Use lltable calculated header when sending lle holdchain after successful lle resolution.
Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D31391
2021-08-05 20:44:36 +00:00
Michael Tuexen
3f1f6b6ef7 tcp, udp: improve input validation in handling bind()
Reported by:		syzbot+24fcfd8057e9bc339295@syzkaller.appspotmail.com
Reported by:		syzbot+6e90ceb5c89285b2655b@syzkaller.appspotmail.com
Reviewed by:		markj, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D31422
2021-08-05 13:48:44 +02:00
Alexander V. Chernikov
f3a3b06121 [lltable] Unify datapath feedback mechamism.
Use newly-create llentry_request_feedback(),
 llentry_mark_used() and llentry_get_hittime() to
 request datapatch usage check and fetch the results
 in the same fashion both in IPv4 and IPv6.

While here, simplify llentry_provide_feedback() wrapper
 by eliminating 1 condition check.

MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D31390
2021-08-04 22:52:43 +00:00
Konstantin Kukushkin
a61c24ddb7 udp: Fix soroverflow SOCKBUF unlocking
We hold the SOCKBUF_LOCK so use soroverflow_locked here.
This bug may manifest as a non-killable process stuck in [*so_rcv].

Approved by:	scottl
Reviewed by:	Roy Marples <roy@marples.name>
Fixes:	7045b1603b
MFC after:  10 days
Differential Revision:	https://reviews.freebsd.org/D31374
2021-08-01 08:07:33 -07:00
Roy Marples
7045b1603b socket: Implement SO_RERROR
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by:	philip (network), kbowling (transport), gbe (manpages)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26652
2021-07-28 09:35:09 -07:00
Bryan Drewery
a573243370 netdump: Fix leaking debugnet state on errors.
Reviewed by:	cem, markj
Sponsored by:	Dell EMC
Differential Revision: https://reviews.freebsd.org/D31319
2021-07-27 09:06:23 -07:00
Mark Johnston
ba21825202 rip: Add missing minimum length validation in rip_output()
If the socket is configured such that the sender is expected to supply
the IP header, then we need to verify that it actually did so.

Reported by:	syzkaller+KMSAN
Reviewed by:	donner
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31302
2021-07-26 16:39:37 -04:00
Kristof Provost
8e1864ed07 pf: syncookie support
Import OpenBSD's syncookie support for pf. This feature help pf resist
TCP SYN floods by only creating states once the remote host completes
the TCP handshake rather than when the initial SYN packet is received.

This is accomplished by using the initial sequence numbers to encode a
cookie (hence the name) in the SYN+ACK response and verifying this on
receipt of the client ACK.

Reviewed by:	kbowling
Obtained from:	OpenBSD
MFC after:	1 week
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D31138
2021-07-20 10:36:13 +02:00
Michael Tuexen
a730d82378 tcp: fix RACK and BBR when using VIMAGE enabled kernel
Fix a bug in VNET handling, which occurs when using specific NICs.
PR:			257195
Reviewed by:		rrs
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D31212
2021-07-20 00:29:18 +02:00
Randall Stewart
db4d2d7222 tcp: When rack or bbr get a pullup failure in the common code, don't free the NULL mbuf.
There is a bug in the error path where rack_bbr_common does a m_pullup() and the pullup fails.
There is a stray mfree(m) after m is set to NULL. This is not a good idea :-)

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31194
2021-07-16 13:59:57 -04:00
Randall Stewart
1d171e5ab9 tcp: Lro needs to validate that it does not go beyond the end of the mbuf as it parses.
Currently the LRO parser, if given a packet that say has ETH+IP header but the TCP header
is in the next mbuf (split), would walk garbage. Lets make sure we keep track as we
parse of the length and return NULL anytime we exceed the length of the mbuf.

Reviewed by: tuexen, hselasky
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31195
2021-07-16 06:07:13 -04:00
Randall Stewart
ca1a7e1021 tcp: TCP_LRO getting bad checksums and sending it in to TCP incorrectly.
In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into
LRO without making sure that the checksum passes. Some drivers actually are
aware of this and do not call lro when the csum failed, others do not do this and
thus could end up sending data up that we think has a checksum passing when
it does not.

This change will fix that situation by properly verifying that the mbuf
has the correct markings (CSUM VALID bits as well as csum in mbuf header
is set to 0xffff).

Reviewed by: tuexen, hselasky, gallatin
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31155
2021-07-13 12:45:15 -04:00
Stefan Eßer
58080fbca0 libalias: fix divide by zero causing panic
The packet_limit can fall to 0, leading to a divide by zero abort in
the "packets % packet_limit".

An possible solution would be to apply a lower limit of 1 after the
calculation of packet_limit, but since any number modulo 1 gives 0,
the more efficient solution is to skip the modulo operation for
packet_limit <= 1.

Since this is a fix for a panic observed in stable/12, merging this
fix to stable/12 and stable/13 before expiry of the 3 day waiting
period might be justified, if it works for the reporter of the issue.

Reported by:	Karl Denninger <karl@denninger.net>
MFC after:	3 days
2021-07-10 13:08:18 +02:00
Michael Tuexen
105b68b42d sctp: Fix errno in case of association setup failures
Do not report always ETIMEDOUT, but only when appropriate. In
other cases report ECONNABORTED.

MFC after:	3 days
2021-07-09 23:19:25 +02:00
Michael Tuexen
ce64352a70 sctp: provide consistent stream information in case of early errors
While there, make sure the function is called correctly.

MFC after:	3 days
2021-07-09 14:16:59 +02:00
Michael Tuexen
84992a3251 sctp: provide sac_error also for ABORT chunk being sent
Thanks to Florent Castelli for bringing this issue up for the
userland stack and providing an initial patch.

MFC:		3 days
2021-07-09 13:46:27 +02:00
Randall Stewart
7312e4e5cf tcp: Fix 32 bit platform breakage
This fixes the incorrect use of a sysctl add to u64. It
was for a useconds time, but on 32 bit platforms its
not a u64. Instead use the long directive.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31107
2021-07-08 08:16:45 -04:00
Andrew Gallatin
b1e806c0ed tcp: fix alternate stack build with LINT-NO{INET,INET6,IP}
When fixing another bug, I noticed that the alternate
TCP stacks do not build when various combinations of
ipv4 and ipv6 are disabled.

Reviewed by:	rrs, tuexen
Differential Revision:	https://reviews.freebsd.org/D31094
Sponsored by: Netflix
2021-07-07 13:02:08 -04:00
Randall Stewart
d7955cc0ff tcp: HPTS performance enhancements
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083
2021-07-07 07:22:35 -04:00
Randall Stewart
e834f9a44a tcp: Address goodput and TLP edge cases.
There are several cases where we make a goodput measurement and we are running
out of data when we decide to make the measurement. In reality we should not make
such a measurement if there is no chance we can have "enough" data. There is also
some corner case TLP's that end up not registering as a TLP like they should, we
fix this by pushing the doing_tlp setup to the actual timeout that knows it did
a TLP. This makes it so we always have the appropriate flag on the sendmap
indicating a TLP being done as well as count correctly so we make no more
that two TLP's.

In addressing the goodput lets also add a "quality" metric that can be viewed via
blackbox logs so that a casual observer does not have to figure out how good
of a measurement it is. This is needed due to the fact that we may still make
a measurement that is of a poorer quality as we run out of data but still have
a minimal amount of data to make a measurement.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31076
2021-07-06 15:26:37 -04:00
Andrew Gallatin
28d0a740dd ktls: auto-disable ifnet (inline hw) kTLS
Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.

This breaks down when transmits are out of order (eg,
TCP retransmits).  In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted.  This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.

This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.

Reviewed by:	hselasky, markj, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30908
2021-07-06 10:28:32 -04:00
Lutz Donnerhacke
4060e77f49 libalias: Remove a stray directive
Removal of a preprocessor line was missed during development.
Do it now and MFC it together with the other patches.

MFC after:	2 days
2021-07-04 17:54:45 +02:00
Lutz Donnerhacke
2f4d91f9cb libalias: Rewrite HISTORY
Fix the history entry (wrong year) and add the missing recent work.
MFC together with the other patches.

MFC after:	2 days
2021-07-04 17:46:47 +02:00
Lutz Donnerhacke
f284553444 libalias: Fix API bug on initialization
The kernel part of ipfw(8) does initialize LibAlias uncondistionally
with an zeroized port range (allowed ports from 0 to 0).  During
restucturing of libalias, port ranges are used everytime and are
therefor initialized with different values than zero.  The secondary
initialization from ipfw (and probably others) overrides the new
default values and leave the instance in an unfunctional state.  The
obvious solution is to detect such reinitializations and use the new
default value instead.

MFC after:	3 days
2021-07-03 23:03:07 +02:00
Lutz Donnerhacke
b50a4dce18 libalias: Avoid uninitialized expiration
The expiration time of direct address mappings is explicitly
uninitialized.  Expire times are always compared during housekeeping.
Despite the uninitialized value does not harm, it's simpler to just
set it to a reasonable default.  This was detected during valgrinding
the test suite.

MFC after:	3 days
2021-07-03 01:09:18 +02:00
Lutz Donnerhacke
25392fac94 libalias: Fix splay comparsion bug
Comparing elements in a tree requires transitiviy.  If a < b and b < c
then a must be smaller than c.  This way the tree elements are always
pairwise comparable.

Tristate comparsion functions returning values lower, equal, or
greater than zero, are usually implemented by a simple subtraction of
the operands.  If the size of the operands are equal to the size of
the result, integer modular arithmetics kick in and violates the
transitivity.

Example:
Working on byte with 0, 120, and 240. Now computing the differences:
  120 -   0 = 120
  240 - 120 = 120
  240 -   0 = -16

MFC after:	3 days
2021-07-03 00:31:53 +02:00
Michael Tuexen
c7f048ab35 sctp: initialize sequence numbers for ECN correctly
MFC after:	3 days
Reported by:	Junseok Yang (for the userland stack)
2021-06-27 20:14:48 +02:00
Michael Tuexen
6587a2bd1e sctp: Fix length check for ECNE chunks
MFC after:	3 days
2021-06-27 16:10:39 +02:00
Michael Tuexen
870af3f4dc tcp: tolerate missing timestamps
Some TCP stacks negotiate TS support, but do not send TS at all
or not for keep-alive segments. Since this includes modern widely
deployed stacks, tolerate the violation of RFC 7323 per default.

Reviewed by:		rgrimes, rrs, rscheff
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D30740
Sponsored by:		Netflix, Inc.
2021-06-27 16:03:57 +02:00
Randall Stewart
9e4d9e4c4d tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software.
Hardware TLS is now supported in some interface cards and it works well. Except that
when we have connections that retransmit a lot we get into trouble with all the retransmits.
This prep step makes way for change that Drew will be making so that we can "kick out" a
session from hardware TLS.

Reviewed by: mtuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30895
2021-06-25 09:30:54 -04:00
Randall Stewart
66aec14a53 tcp: Rack not being very friendly with V6:4 socket and having a connection from V4
There were two bugs that prevented V4 sockets from connecting to
a rack server running a V4/V6 socket. As well as a bug that stops the
mapped v4 in V6 address from working.

Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30885
2021-06-24 14:42:21 -04:00
Wojciech Macek
17ac6d94db ip_mroute: initialize vif ifnet properly
Use if_alloc to ensure all fields of ifnet are allocated
properly

Reported by:   Damien Deville
Sponsored by:  Stormshield
Obtained from: Semihalf
Reviewed by:   mw
Differential revision: https://reviews.freebsd.org/D30608
2021-06-23 10:13:52 +02:00
Lutz Donnerhacke
f70c98a2f5 libalias: Fix compile time warning about unused functions
Compiling libalias results in warnings about unused functions.
Those warnings are caused by clang's heuristic to consider an inline
function as in use, iff the declaration is in a *.c file.
Declarations in *.h files do not emit those warnings.

Hence the declarations must be moved to an extra *.h file.

MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D30844
2021-06-23 10:06:04 +02:00
John Baldwin
a7f6c6fd94 toe: Read-lock the inp in toe_4tuple_check().
tcp_twcheck now expects a read lock on the inp for the SYN case
instead of a write lock.

Reviewed by:	np
Fixes:		1db08fbe3f tcp_input: always request read-locking of PCB for any pure SYN segment.
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D30782
2021-06-22 16:31:01 -07:00
Gleb Smirnoff
c4804b6b0b Unbreak TFO, that was broken with 8d5719aa74. These two assignments
are unneccessary and used to be there before TFO as an invariant.  With
TFO and after 8d5719aa74 the "so" value is still needed.

Reported & tested by:	tuexen
Fixes:	8d5719aa74
2021-06-22 16:03:44 -07:00
Lutz Donnerhacke
d261e57dea libalias: Switch to efficient data structure for incoming traffic
Current data structure is using a hash of unordered lists.  Those
unordered lists are quite efficient, because the least recently
inserted entries are most likely to be used again.  In order to avoid
long search times in other cases, the lists are hashed into many
buckets.  Unfortunatly a search for a miss needs an exhaustive
inspection and a careful definition of the hash.

Splay trees offer a similar feature: Almost O(1) for access of the
least recently used entries, and amortized O(ln(n)) for almost all
other cases.  Get rid of the hash.

Now the data structure should able to quickly react to external
packets without eating CPU cycles for breakfast, preventing a DoS.

PR:		192888
Discussed with:	Dimitry Luhtionov
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30536
2021-06-19 22:12:28 +02:00
Lutz Donnerhacke
935fc93af1 libalias: Switch to efficient data structure for outgoing traffic
Current data structure is using a hash of unordered lists.  Those
unordered lists are quite efficient, because the least recently
inserted entries are most likely to be used again.  In order to avoid
long search times in other cases, the lists are hashed into many
buckets.  Unfortunatly a search for a miss needs an exhaustive
inspection and a careful definition of the hash.

Splay trees offer a similar feature - almost O(1) for access of the
least recently used entries), and amortized O(ln(n) - for almost all
other cases.  Get rid of the hash.

Discussed with:	Dimitry Luhtionov
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30516
2021-06-19 22:09:44 +02:00
Lutz Donnerhacke
d989935b5b libalias: Restructure - Finalize
Note, that the restructuring is done.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30582
2021-06-19 21:58:56 +02:00
Lutz Donnerhacke
fe83900f9f libalias: Restructure - Remove temporary state deleteAllLinks from global struct
The entry deleteAllLinks in the struct libalias is only used to signal
a state between internal calls.  It's not used between API calls.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30604
2021-06-19 21:55:11 +02:00
Lutz Donnerhacke
9efcad61d8 libalias: Restructure - Use AliasRange instead of PORT_BASE
Get rid of PORT_BASE, replace by AliasRange. Simplify code.
Factor out the search for a new port. Improves the perfomance a bit.

Discussed with:	Dimitry Luhtionov
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30581
2021-06-19 21:40:09 +02:00
Lutz Donnerhacke
1178dda53d libalias: Restructure - Table for PPTP
Let PPTP use its own data structure.
Regroup and rename other lists, which are not PPTP.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30580
2021-06-19 21:26:31 +02:00
Lutz Donnerhacke
7b44ff4c52 libalias: Restructure - Group expire handling entries
Reorder the internal structure semantically.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30575
2021-06-19 21:12:27 +02:00
Lutz Donnerhacke
492d3b7109 libalias: Restructure - Group incoming links
Reorder incoming links by grouping of common search terms.
Significant performance improvement for incoming (missing) flows.

Remove LSNAT from outgoing search.
Slight speedup due to less comparsions in the loop.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30574
2021-06-19 21:03:47 +02:00
Lutz Donnerhacke
d4ab07d2ae libalias: Restructure - Cleanup and Use for links
Factor out a common idiom to return found links.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30573
2021-06-19 20:28:53 +02:00
Lutz Donnerhacke
d541903438 libalias: Restructure - Outgoing search
Factor out the outgoing search function.
Preparation for a new data structure.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30572
2021-06-19 20:25:08 +02:00
Lutz Donnerhacke
19dcc4f225 libalias: Restructure - Cleanup _FindLinkIn
Simplify program flow in function _FindLinkIn.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30571
2021-06-19 20:19:16 +02:00
Lutz Donnerhacke
cac129e603 libalias: Restructure - Table for partially links
Separate the partially specified links into a separate data structure.

This would causes a major parformance impact, if there are many of
them.  Use a (smaller) hash table to speed up the partially link
access.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30570
2021-06-19 20:03:08 +02:00
Richard Scheffenegger
74d7fc8753 tcp: Add PRR cwnd reduction for non-SACK loss
This completes PRR cwnd reduction in all circumstances
for the base TCP stack (SACK loss recovery, ECN window reduction,
non-SACK loss recovery), preventing the arriving ACKs to
clock out new data at the old, too high rate. This
reduces the chance to induce additional losses while
recovering from loss (during congested network conditions).

For non-SACK loss recovery, each ACK is assumed to have
one MSS delivered. In order to prevent ACK-split attacks,
only one window worth of ACKs is considered to actually
have delivered new data.

MFC after: 6 weeks
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29441
2021-06-19 19:25:22 +02:00
Lutz Donnerhacke
32f9c2ceb3 libalias: Restructure - Separate fully qualified search
Search fully specified links first.  Some performance loss due to need
to revisit the db twice, if not found.

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30569
2021-06-19 19:21:05 +02:00
Lutz Donnerhacke
d41044ddfd libalias: Restructure - Common search terms
Factor out the common Out and In filter
Slightly better performance due to eager skip of search loop

MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30568
2021-06-19 18:58:52 +02:00