Commit Graph

6733 Commits

Author SHA1 Message Date
Alexander V. Chernikov
4684d3cbcb Remove per-AF radix_mpath initializtion functions.
Split their functionality by moving random seed allocation
 to SYSINIT and calling (new) generic multipath function from
 standard IPv4/IPv5 RIB init handlers.

Differential Revision:	https://reviews.freebsd.org/D24356
2020-04-11 07:37:08 +00:00
Warner Losh
28540ab153 Fix copyright year and eliminate the obsolete all rights reserved line.
Reviewed by: rrs@
2020-04-08 17:55:45 +00:00
Michael Tuexen
f4cb790a35 Do more argument validation under INVARIANTS when starting/stopping
an SCTP timer.

MFC after:		1 week
2020-04-06 13:58:13 +00:00
Alexander V. Chernikov
66bc03d415 Use interface fib for proxyarp checks.
Before the change, proxyarp checks for src and dst addresses
 were performed using default fib, breaking multi-fib scenario.

PR:		245181
Submitted by:	Scott Aitken (original version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24244
2020-04-02 20:06:37 +00:00
Michael Tuexen
413c3db101 Allow the TCP backhole detection to be disabled at all, enabled only
for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6.
The current blackhole detection might classify a temporary outage as
an MTU issue and reduces permanently the MSS. Since the consequences of
such a reduction due to a misclassification are much more drastically
for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only.

Reviewed by:		bcr@ (man page), Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24219
2020-03-31 15:54:54 +00:00
Mark Johnston
9b1d850be8 Remove the "config" taskqgroup and its KPIs.
Equivalent functionality is already provided by taskqueue(9), just use
that instead.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-03-30 14:24:03 +00:00
Michael Tuexen
9aca687811 Small cleanup by using a variable just assigned.
MFC after:		1 week
2020-03-28 22:35:04 +00:00
Michael Tuexen
25ec355353 Handle integer overflows correctly when converting msecs and secs to
ticks and vice versa.
These issues were caught by recently added panic() calls on INVARIANTS
systems.

Reported by:		syzbot+b44787b4be7096cd1590@syzkaller.appspotmail.com
Reported by:		syzbot+35f82d22805c1e899685@syzkaller.appspotmail.com
MFC after:		1 week
2020-03-28 20:25:45 +00:00
Ed Maste
c012cfe68a sys/netinet: remove spurious doubled ;s 2020-03-27 23:10:18 +00:00
Michael Tuexen
d5d190f2f9 Some more uint32_t cleanups, no functional change.
MFC after:		1 week
2020-03-27 21:48:52 +00:00
Michael Tuexen
239e5865df Use uint32_t where it is expected to be used. No functional change.
MFC after:		1 week
2020-03-27 11:08:11 +00:00
Michael Tuexen
7c63520c42 Remove an optimization, which was incorrect a couple of times and
therefore doesn't seem worth to be there.
In this case COOKIE where not retransmitted anymore, when the
socket was already closed.

MFC after:		1 week
2020-03-25 18:20:37 +00:00
Michael Tuexen
37686ccf08 Improve consistency in debug output.
MFC after:		1 week
2020-03-25 18:14:12 +00:00
Michael Tuexen
24187cfe72 Revert https://svnweb.freebsd.org/changeset/base/357829
This introduces a regression reported by koobs@ when running a pyhton
test suite on a loaded system.

This patch resulted in a failing accept() call, when the association
was setup and gracefully shutdown by the peer before accept was called.
So the following packetdrill script would fail:

+0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.0 bind(3, ..., ...) = 0
+0.0 listen(3, 1) = 0
+0.0 < sctp: INIT[flgs=0, tag=1, a_rwnd=15000, os=1, is=1, tsn=1]
+0.0 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=1, ...]
+0.1 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.0 > sctp: COOKIE_ACK[flgs=0]
+0.0 < sctp: DATA[flgs=BE, len=116, tsn=1, sid=0, ssn=0, ppid=0]
+0.0 > sctp: SACK[flgs=0, cum_tsn=1, a_rwnd=..., gaps=[], dups=[]]
+0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0]
+0.0 > sctp: SHUTDOWN_ACK[flgs=0]
+0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0]
+0.0 accept(3, ..., ...) = 4
+0.0 close(3) = 0
+0.0 recv(4, ..., 4096, 0) = 100
+0.0 recv(4, ..., 4096, 0) = 0
+0.0 close(4) = 0

Reported by:		koops@
2020-03-25 15:29:01 +00:00
Michael Tuexen
23e3c0880d Use consistent debug output.
MFC after:		1 week
2020-03-25 13:19:41 +00:00
Michael Tuexen
e056fafd92 Don't restore the vnet too early in error cases.
MFC after:		1 week
2020-03-25 13:18:37 +00:00
Michael Tuexen
7522682e5e Only call panic when building with INVARIANTS.
MFC after:		1 week
2020-03-24 23:04:07 +00:00
Michael Tuexen
a412576e36 Another cleanup of the timer code. Also be more pedantic about the
parameters of the timer start and stop routines. Several inconsistencies
have been fixed in earlier commits. Now they will be catched when running
an INVARIANTS system.

MFC after:	1 week
2020-03-24 22:44:36 +00:00
Michael Tuexen
d084818d9d Cleanup the file and add two ASSERT variants for locks, which will be
used shortly.

MFC after:		1 week
2020-03-23 12:17:13 +00:00
Michael Tuexen
a57fb68b92 More timer cleanups, no functional change.
MFC after:		1 week
2020-03-21 16:12:19 +00:00
Michael Tuexen
fa8ceba9ca Remove a set, but unused variable.
MFC after:		1 week
2020-03-20 14:49:44 +00:00
Michael Tuexen
2bdebd0ce3 A a missing NET_EPOCH_ENTER/NET_EPOCH_EXIT pair. This was affecting
implicit connection setups via sendmsg().

Reported by:		syzbot+febbe3383a0e9b700c1b@syzkaller.appspotmail.com
Reported by:		syzbot+dca98631455d790223ca@syzkaller.appspotmail.com
Reported by:		syzbot+5a71a7760d6bcf11b8cd@syzkaller.appspotmail.com
Reported by:		syzbot+da64217e140444c49f00@syzkaller.appspotmail.com
2020-03-19 23:07:52 +00:00
Michael Tuexen
6fb7b4fbdb Consistently provide arguments for timer start and stop routines.
This is another step in cleaning up timer handling.
MFC after:		1 week
2020-03-19 21:01:16 +00:00
Michael Tuexen
e95b3d7faf Cleanup the stream reset and asconf timer.
MFC after:		1 week
2020-03-19 18:55:54 +00:00
Michael Tuexen
42078d5ada The MTU candidates MUST be a multiple of 4, so make them so.
MFC after:		1 week
2020-03-19 14:37:28 +00:00
Michael Tuexen
0554e01d8b Handle the timers in a consistent sequence according to the definition
of the timer type.
Just a cleanup, no functional change intended.

MFC after:		1 week
2020-03-17 19:20:12 +00:00
Andrew Gallatin
ee7a9e506e Avoid a cache miss accessing an mbuf ext_pgs pointer when doing SW kTLS.
For a Netflix 90Gb/s 100% TLS software kTLS workload, this reduces
the CPI of tcp_m_copym() from ~3.5 to ~2.5 as reported by vtune.

Reviewed by:	jtl, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23998
2020-03-16 14:03:27 +00:00
Michael Tuexen
7ca6e2963f Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since
these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that
instead of TCPSTAT_ADD.

Reviewed by:		jtl@, rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23904
2020-03-12 15:37:41 +00:00
Andrew Gallatin
98085bae8c make lacp's use_numa hashing aware of send tags
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
     lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
     always called by lacp_select_tx_port()

-   pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

-   adds a numa_domain field to if_snd_tag_alloc_params

-   plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by:	rrs, hselasky, jhb
Sponsored by:	Netflix
2020-03-09 13:44:51 +00:00
Hiroki Sato
d726e6331b Fix an issue of net.inet.igmp.stats handler.
The header of (struct igmpstat) could be cleared by sysctl(3).
This can be reproduced by "netstat -s -z -p igmp".

PR:	244584
MFC after:	1 week
2020-03-07 08:41:10 +00:00
Michael Tuexen
9c04fdfd34 When using automatically generated flow labels and using TCP SYN
cookies, use the same flow label for the segments sent during the
handshake and after the handshake.
This fixes a bug by making sure that sc_flowlabel is always stored in
network byte order.

Reviewed by:		bz@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23957
2020-03-04 16:41:25 +00:00
Bjoern A. Zeeb
d2b8fd0da1 Add new ICMPv6 counters for Anti-DoS limits.
Add four new counters for ND6 related Anti-DoS measures.
We split these out into a separate upfront commit so that we only
change the struct size one time.  Implementations using them will
follow.

PR:		157410
Reviewed by:	melifaro
MFC after:	2 weeks
X-MFC:		cannot really MFC this without breaking netstat
Sponsored by:	Netflix (initially)
Differential Revision:	https://reviews.freebsd.org/D22711
2020-03-04 16:20:59 +00:00
Michael Tuexen
6605e5791f Don't send an uninitilised traffic class in the IPv6 header, when
sending a TCP segment from the TCP SYN cache (like a SYN-ACK).
This fix initialises it to zero. This is correct for the ECN bits,
but is does not honor the DSCP what an application might have set via
the IPPROTO_IPV6 level socket options IPV6_TCLASS. That will be
fixed separately.

Reviewed by:		Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23900
2020-03-04 12:22:53 +00:00
Bjoern A. Zeeb
4e1a3ff884 tcp_hpts: make RSS kernel compile again.
Add proper #includes, and #ifdefs and some style fixes to make RSS
kernels compile again.  There are still possible issues with uin16_t
vs. uint_t cpuid which I am not going near.

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D23726
2020-03-03 14:15:30 +00:00
Michael Tuexen
7e1e491f60 Remove stale definitions. The removed definitions are not used right
now and are incompatible with the correct ones in RFC 3168.

Submitted by:		Richard Scheffenegger
Differential Revision:	https://reviews.freebsd.org/D23903
2020-03-01 12:34:27 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Randall Stewart
d7313dc6f5 This commit expands tcp_ratelimit to be able to handle cards
like the mlx-c5 and c6 that require a "setup" routine before
the tcp_ratelimit code can declare and use a rate. I add the
setup routine to if_var as well as fix tcp_ratelimit to call it.
I also revisit the rates so that in the case of a mlx card
of type c5/6 we will use about 100 rates concentrated in the range
where the most gain can be had (1-200Mbps). Note that I have
tested these on a c5 and they work and perform well. In fact
in an unloaded system they pace right to the correct rate (great
job mlx!). There will be a further commit here from Hans that
will add the respective changes to the mlx driver to support this
work (which I was testing with).

Sponsored by:	Netflix Inc.
Differential Revision:	ttps://reviews.freebsd.org/D23647
2020-02-26 13:48:33 +00:00
Pawel Biernacki
295a18d184 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (14 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Approved by:	kib (mentor, blanket)
Differential Revision:	https://reviews.freebsd.org/D23639
2020-02-24 10:47:18 +00:00
Pawel Biernacki
10b49b2302 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (6 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

Mark all nodes in pf, pfsync and carp as MPSAFE.

Reviewed by:	kp
Approved by:	kib (mentor, blanket)
Differential Revision:	https://reviews.freebsd.org/D23634
2020-02-21 16:23:00 +00:00
Michael Tuexen
64f29eb1df Remove an unused timer type.
MFC after:		1 week
2020-02-20 15:37:44 +00:00
Michael Tuexen
868b51f234 Epochify SCTP. 2020-02-18 21:25:17 +00:00
Michael Tuexen
ba0d525006 Remove unused function. 2020-02-18 19:41:55 +00:00
Michael Tuexen
a610bb2120 Fix the non-default stream schedulers such that do not interleave
user messages when it is now allowed.

Thanks to Christian Wright for reporting the issue for the userland
stack and providing a fix for the priority scheduler.

MFC after:		1 week
2020-02-17 18:05:03 +00:00
Michael Tuexen
6b8fba3c5c Don't use uninitialised stack memory if the sysctl variable
net.inet.tcp.hostcache.enable is set to 0.
The bug resulted in using possibly a too small MSS value or wrong
initial retransmission timer settings. Possibly the value used
for ssthresh was also wrong.

Submitted by:		Richard Scheffenegger
Reviewed by:		Cheng Cui, rgrimes@, tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23687
2020-02-17 14:54:21 +00:00
Hans Petter Selasky
bacb11c9ed Fix kernel panic while trying to read multicast stream.
When VIMAGE is enabled make sure the "m_pkthdr.rcvif" pointer is set
for all mbufs being input by the IGMP/MLD6 code. Else there will be a
NULL-pointer dereference in the netisr code when trying to set the
VNET based on the incoming mbuf. Add an assert to catch this when
queueing mbufs on a netisr to make debugging of similar cases easier.

Found by:	Vladislav V. Prodan
PR:		244002
Reviewed by:	bz@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2020-02-17 09:46:32 +00:00
Mateusz Guzik
6b25673f3f sctp: use new capsicum helpers 2020-02-15 01:29:40 +00:00
Michael Tuexen
a357466592 sack_newdata and snd_recover hold the same value. Therefore, use only
a single instance: use snd_recover also where sack_newdata was used.

Submitted by:		Richard Scheffenegger
Differential Revision:	https://reviews.freebsd.org/D18811
2020-02-13 15:14:46 +00:00
Michael Tuexen
33f8cfdfe4 Whitespace cleanup. No functional change.
Sponsored by:		Netflix, Inc.
2020-02-13 13:58:34 +00:00
Michael Tuexen
56ccb48fd6 Don't panic under INVARIANTS when we can't allocate memory for storing
a vtag in time wait.
This issue was found by running syzkaller.

MFC after:		1 week
2020-02-12 17:05:10 +00:00
Michael Tuexen
ca3de626ec Mark the socket as disconnected when freeing the association the first
time.
This issue was found by running syzkaller.

MFC after:		1 week
2020-02-12 17:02:15 +00:00
Randall Stewart
348404bce1 Lets get the real correct version.. gessh. I need
more coffee evidently.

Sponsored by:	Netflix
2020-02-12 15:26:56 +00:00
Randall Stewart
b8f8a6b719 Opps committed the wrong ratelimit version in the
whitespace cleanup.. Restore it to the proper version.

Sponsored by:	Netfilx Inc.
2020-02-12 13:37:53 +00:00
Randall Stewart
481be5de9d White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by:	Netflix Inc.
2020-02-12 13:31:36 +00:00
Randall Stewart
df341f5986 Whitespace, remove from three files trailing white
space (leftover presents from emacs).

Sponsored by:	Netflix Inc.
2020-02-12 13:07:09 +00:00
Randall Stewart
596ae436ef This small fix makes it so we properly follow
the RFC and only enable ECN when both the
CWR and ECT bits our set within the SYN packet.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D23645
2020-02-12 13:04:19 +00:00
Randall Stewart
3fba40d9f2 Remove all trailing white space from the BBR/Rack fold. Bits
left around by emacs (thanks emacs).
2020-02-12 12:40:06 +00:00
Randall Stewart
d2517ab04b Now that all of the stats framework is
in FreeBSD the bits that disabled stats
when netflix-stats is not defined is no longer
needed. Lets remove these bits so that we
will properly use stats per its definition
in BBR and Rack.

Sponsored by:	Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D23088
2020-02-12 12:36:55 +00:00
Michael Tuexen
8803350d6d Revert https://svnweb.freebsd.org/changeset/base/357761
This was suggested by cem@
2020-02-11 20:02:20 +00:00
Michael Tuexen
9803f01cdb Don't start an SCTP timer using a net, which has been removed.
Submitted by:		Taylor Brandstetter
MFC after:		1 week
2020-02-11 18:15:57 +00:00
Michael Tuexen
95d27478d2 Use an int instead of a bool variable, since bool is not supported
on all platforms the stack is running on in userland.
2020-02-11 14:00:27 +00:00
Michael Tuexen
6a34ec63ab Stop the PMTU and HB timer when removing a net, not when freeing it.
Submitted by:		Taylor Brandstetter
MFC after:		1 week
2020-02-09 22:40:05 +00:00
Michael Tuexen
5555400aa5 Cleanup timer handling.
Submitted by:	Taylor Brandstetter
MFC after:	1 week
2020-02-09 22:05:41 +00:00
Ed Maste
5aa0576b33 Miscellaneous typo fixes
Submitted by:	Gordon Bergling <gbergling_gmail.com>
Differential Revision:	https://reviews.freebsd.org/D23453
2020-02-07 19:53:07 +00:00
Michael Tuexen
f799ff82fb Remove unused timer.
Submitted by:		Taylor Brandstetter
2020-02-04 14:01:07 +00:00
Michael Tuexen
bbf9f080e9 Improve numbering of debug information.
Submitted by:		Taylor Brandstetter
MFC after:		1 week
2020-02-04 12:34:16 +00:00
Conrad Meyer
8e6b06be14 netinet/libalias: Fix typo in debug message
No functional change.

PR:		243831
Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D23365
2020-02-03 05:19:44 +00:00
Gleb Smirnoff
42ce79378d Fix missing NET_EPOCH_ENTER() when compiled with TCP_OFFLOAD.
Reported by:	Coverity
CID:		1413162
2020-01-29 22:48:18 +00:00
Michael Tuexen
dc13edbc7d Fix build issues for the userland stack on 32-bit platforms.
Reported by:		Felix Weinrank
MFC after:		1 week
2020-01-28 10:09:05 +00:00
Alexander V. Chernikov
75831a1c95 Fix NOINET6 build after r357038.
Reported by:	AN <andy at neu.net>
2020-01-26 11:54:21 +00:00
Michael Tuexen
9cc711c9ff Sending CWR after an RTO is according to RFC 3168 generally required
and not only for the DCTCP congestion control.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes, tuexen@, Cheng Cui
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23119
2020-01-25 13:45:10 +00:00
Michael Tuexen
47e2c17c12 Don't set the ECT codepoint on retransmitted packets during SACK loss
recovery. This is required by RFC 3168.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@, Cheng Cui
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23118
2020-01-25 13:34:29 +00:00
Michael Tuexen
a2d59694be As a TCP client only enable ECN when the corresponding sysctl variable
indicates that ECN should be negotiated for the client side.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23228
2020-01-25 13:11:14 +00:00
Michael Tuexen
ee97681e5c Don't delay the ACK for a TCP segment with the CWR flag set.
This allows the data sender to increase the CWND faster.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@, Cheng Cui
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D22670
2020-01-24 22:50:23 +00:00
Michael Tuexen
8f63a52bdb The server side of TCP fast open relies on the delayed ACK timer to allow
including user data in the SYN-ACK. When DSACK support was added in
r347382, an immediate ACK was sent even for the received SYN with
user data. This patch fixes that and allows again to send user data with
the SYN-ACK.

Reported by:		Jeremy Harris
Reviewed by:		Richard Scheffenegger, rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23212
2020-01-24 22:37:53 +00:00
Gleb Smirnoff
e1d2b46953 Enter the network epoch when rack_output() is called in setsockopt(2). 2020-01-24 21:56:10 +00:00
Alexander V. Chernikov
75b893375f Add support for RFC 6598/Carrier Grade NAT subnets. to libalias and ipfw.
In libalias, a new flag PKT_ALIAS_UNREGISTERED_RFC6598 is added.
 This is like PKT_ALIAS_UNREGISTERED_ONLY, but also is RFC 6598 aware.
Also, we add a new NAT option to ipfw called unreg_cgn, which is like
 unreg_only, but also is RFC 6598-aware.  The reason for the new
 flags/options is to avoid breaking existing networks, especially those
 which rely on RFC 6598 as an external address.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22877
2020-01-24 20:35:41 +00:00
Alexander V. Chernikov
ab15488f12 Bring indentation back to normal after r357038.
No functional changes.

MFC after:	3 weeks
2020-01-23 09:46:45 +00:00
Alexander V. Chernikov
5533ec4806 Fix epoch-related panic in ipdivert, ensuring in_broadcast() is called
within epoch.

Simplify gigantic div_output() by splitting it into 3 functions,
 handling preliminary setup, remote "ip[6]_output" case and
 local "netisr" case. Leave original indenting in most parts to ease
 diff comparison.  Indentation will be fixed by a followup commit.

Reported by:	Nick Hibma <nick at van-laarhoven.org>
Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D23317
2020-01-23 09:14:28 +00:00
Gleb Smirnoff
a3b0db5b0a Plug possible calls into ip6?_output() without network epoch from SCTP
bluntly adding epoch entrance into the macro that SCTP uses to call
ip6?_output().  This definitely will introduce several epoch recursions.

Reported by:	https://syzkaller.appspot.com/bug?id=79f03f574594a5be464997310896765c458ed80a
Reported by:	https://syzkaller.appspot.com/bug?id=07c6f52106cddbe356cc2b2f3664a1c51cc0dadf
2020-01-22 17:19:53 +00:00
Bjoern A. Zeeb
7754e281c0 Fix NOINET kernels after r356983.
All gotos to the label are within the #ifdef INET section, which leaves
us with an unused label.  Cover the label under #ifdef INET as well to
avoid the warning and compile time error.
2020-01-22 15:06:59 +00:00
Alexander V. Chernikov
34a5582c47 Bring back redirect route expiration.
Redirect (and temporal) route expiration was broken a while ago.
This change brings route expiration back, with unified IPv4/IPv6 handling code.

It introduces net.inet.icmp.redirtimeout sysctl, allowing to set
 an expiration time for redirected routes. It defaults to 10 minutes,
 analogues with net.inet6.icmp6.redirtimeout.

Implementation uses separate file, route_temporal.c, as route.c is already
 bloated with tons of different functions.
Internally, expiration is implemented as an per-rnh callout scheduled when
 route with non-zero rt_expire time is added or rt_expire is changed.
 It does not add any overhead when no temporal routes are present.

Callout traverses entire routing tree under wlock, scheduling expired routes
 for deletion and calculating the next time it needs to be run. The rationale
 for such implemention is the following: typically workloads requiring large
 amount of routes have redirects turned off already, while the systems with
 small amount of routes will not inhibit large overhead during tree traversal.

This changes also fixes netstat -rn display of route expiration time, which
 has been broken since the conversion from kread() to sysctl.

Reviewed by:	bz
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D23075
2020-01-22 13:53:18 +00:00
Gleb Smirnoff
c1604fe4d2 Make in_pcbladdr() require network epoch entered by its callers. Together
with this widen network epoch coverage up to tcp_connect() and udp_connect().

Revisions from r356974 and up to this revision cover D23187.

Differential Revision:	https://reviews.freebsd.org/D23187
2020-01-22 06:10:41 +00:00
Gleb Smirnoff
e2636f0a78 Remove extraneous NET_EPOCH_ASSERT - the full function is covered. 2020-01-22 06:07:27 +00:00
Gleb Smirnoff
3fed74e90f Re-absorb tcp_detach() back into tcp_usr_detach() as the comment suggests.
Not a functional change.
2020-01-22 06:06:27 +00:00
Gleb Smirnoff
5fc8df3c49 Don't enter network epoch in tcp_usr_detach. A PCB removal doesn't
require that.
2020-01-22 06:04:56 +00:00
Gleb Smirnoff
5c722e2ad3 The network epoch changes in the TCP stack combined with old r286227,
actually make removal of a PCB not needing ipi_lock in any form.  The
ipi_list_lock is sufficient.
2020-01-22 06:03:45 +00:00
Gleb Smirnoff
7669c586da tcp_usr_attach() doesn't need network epoch. in_pcbfree() and
in_pcbdetach() perform all necessary synchronization themselves.
2020-01-22 06:01:26 +00:00
Gleb Smirnoff
6a2954a17d Relax locking requirements for in_pcballoc(). All pcbinfo fields
modified by this function are protected by the PCB list lock that is
acquired inside the function.

This could have been done even before epoch changes, after r286227.
2020-01-22 05:58:29 +00:00
Gleb Smirnoff
0f6385e705 Inline tcp_attach() into tcp_usr_attach(). Not a functional change. 2020-01-22 05:54:58 +00:00
Gleb Smirnoff
109eb549e1 Make tcp_output() require network epoch.
Enter the epoch before calling into tcp_output() from those
functions, that didn't do that before.

This eliminates a bunch of epoch recursions in TCP.
2020-01-22 05:53:16 +00:00
Gleb Smirnoff
b955545386 Make ip6_output() and ip_output() require network epoch.
All callers that before may called into these functions
without network epoch now must enter it.
2020-01-22 05:51:22 +00:00
Gleb Smirnoff
0452a1f3ef Add documenting NET_EPOCH_ASSERT() to tcp_drop(). 2020-01-22 02:38:46 +00:00
Gleb Smirnoff
bab98355f9 Add some documenting NET_EPOCH_ASSERTs. 2020-01-22 02:37:47 +00:00
Michael Tuexen
6745815d25 Remove debug code not needed anymore.
Submitted by:		Richard Scheffenegger
Reviewed by:		tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23208
2020-01-16 17:15:06 +00:00
Gleb Smirnoff
ed0282f46a A miss from r356754. 2020-01-15 06:12:39 +00:00
Gleb Smirnoff
2a4bd982d0 Introduce NET_EPOCH_CALL() macro and use it everywhere where we free
data based on the network epoch.   The macro reverses the argument
order of epoch_call(9) - first function, then its argument. NFC
2020-01-15 06:05:20 +00:00
Gleb Smirnoff
b1328235b4 Use official macro to enter/exit the network epoch. NFC 2020-01-15 05:48:36 +00:00
Gleb Smirnoff
97168be809 Mechanically substitute assertion of in_epoch(net_epoch_preempt) to
NET_EPOCH_ASSERT(). NFC
2020-01-15 05:45:27 +00:00
Gleb Smirnoff
fae994f636 Stop header pollution and don't include if_var.h via in_pcb.h. 2020-01-15 03:41:15 +00:00
Gleb Smirnoff
8fd73e9160 Since this code dereferences struct ifnet, it must include if_var.h
explicitly, not via header pollution.  While here move TCPSTATES
declaration right above the include that is going to make use of it.
2020-01-15 03:40:32 +00:00
Gleb Smirnoff
9cdc43b16e The non-preemptible network epoch identified by net_epoch isn't used.
This code definitely meant net_epoch_preempt.
2020-01-15 03:30:33 +00:00
Gleb Smirnoff
4c69f60a8e Fix yet another regression from r354484. Error code from cr_cansee()
aliases with hard error from other operations.

Reported by:	flo
2020-01-13 21:12:10 +00:00
Michael Tuexen
fe1274ee39 Fix race when accepting TCP connections.
When expanding a SYN-cache entry to a socket/inp a two step approach was
taken:
1) The local address was filled in, then the inp was added to the hash
   table.
2) The remote address was filled in and the inp was relocated in the
   hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.

Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D22971
2020-01-12 17:52:32 +00:00
Michael Tuexen
fc0eb7637c Fix division by zero issue.
Thanks to Stas Denisov for reporting the issue for the userland stack
and providing a fix.

MFC after:		3 days
2020-01-12 15:45:27 +00:00
Mateusz Guzik
879e0604ee Add KERNEL_PANICKED macro for use in place of direct panicstr tests 2020-01-12 06:07:54 +00:00
Alexander V. Chernikov
ead85fe415 Add fibnum, family and vnet pointer to each rib head.
Having metadata such as fibnum or vnet in the struct rib_head
 is handy as it eases building functionality in the routing space.
This change is required to properly bring back route redirect support.

Reviewed by:	bz
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D23047
2020-01-09 17:21:00 +00:00
Bjoern A. Zeeb
334fc5822b vnet: virtualise more network stack sysctls.
Virtualise tcp_always_keepalive, TCP and UDP log_in_vain.  All three are
set in the netoptions startup script, which we would love to run for VNETs
as well [1].

While virtualising the log_in_vain sysctls seems pointles at first for as
long as the kernel message buffer is not virtualised, it at least allows
an administrator to debug the base system or an individual jail if needed
without turning the logging on for all jails running on a system.

PR:		243193 [1]
MFC after:	2 weeks
2020-01-08 23:30:26 +00:00
Ed Maste
ee92463aca Do not define TCPOUTFLAGS in rack_bbr_common
tcp_outflags isn't used in this source file and compilation failed with
external GCC on sparc64.  I'm not sure why only that case failed (perhaps
inconsistent -Werror config) but it is a legitimate issue to fix.

Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D23068
2020-01-07 17:57:08 +00:00
Randall Stewart
4ad2473790 This catches rack up in the recent changes to ECN and
also commonizes the functions that both the freebsd and
rack stack uses.

Sponsored by:Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D23052
2020-01-06 15:29:14 +00:00
Randall Stewart
a9a08eced6 This change adds a small feature to the tcp logging code. Basically
a connection can now have a separate tag added to the id.

Obtained from:	Lawrence Stewart
Sponsored by:	Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D22866
2020-01-06 12:48:06 +00:00
Michael Tuexen
97a8ab398e Don't make the sendall iterator as being up if it could not be started.
MFC after:		1 week
2020-01-05 14:08:01 +00:00
Michael Tuexen
4b66d476b3 Return -1 consistently if an error occurs.
MFC after:	1 week
2020-01-05 14:06:40 +00:00
Michael Tuexen
397b1c945f Ensure that we don't miss a trigger for kicking off the SCTP iterator.
Reported by:		nwhitehorn@
MFC after:		1 week
2020-01-05 13:56:32 +00:00
Michael Tuexen
ae7cc6c9f8 Make the message size limit used for SCTP_SENDALL configurable via
a sysctl variable instead of a compiled in constant.

This is based on a patch provided by nwhitehorn@.
2020-01-04 20:33:12 +00:00
Mark Johnston
31069f383a Take the ifnet's address lock in igmp_v3_cancel_link_timers().
inm_rele_locked() may remove the multicast address associated with inm.

Reported by:	syzbot+871c5d1fd5fac6c28f52@syzkaller.appspotmail.com
Reviewed by:	hselasky
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23009
2020-01-03 17:03:10 +00:00
Michael Tuexen
0eeb0d180e Remove empty line which was added in r356270 by accident.
MFC after:		1 week
2020-01-02 14:04:16 +00:00
Michael Tuexen
ac1d75d23a Improve input validation of the spp_pathmtu field in the
SCTP_PEER_ADDR_PARAMS socket option. The code in the stack assumes
sane values for the MTU.

This issue was found by running an instance of syzkaller.

MFC after:		1 week
2020-01-02 13:55:10 +00:00
Gleb Smirnoff
8d5c56dab1 In r343631 error code for a packet blocked by a firewall was
changed from EACCES to EPERM.  This change was not intentional,
so fix that.  Return EACCESS if a firewall forbids sending.

Noticed by:	ae
2020-01-01 17:31:43 +00:00
Alexander Motin
8c3fbf3c20 Relax locking of carp_forus().
This fixes deadlock between CARP and bridge.  Bridge calls this function
taking CARP lock while holding bridge lock.  Same time CARP tries to send
its announcements via the bridge while holding CARP lock.

Use of CARP_LOCK() here does not solve anything, since sc_addr is constant
while race on sc_state is harmless and use of the lock does not close it.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-12-31 18:58:29 +00:00
Michael Tuexen
e11c9783e1 Fix delayed ACK generation for DCTCP.
Submitted by:		Richard Scheffenegger
Reviewed by:		chengc@netapp.com, rgrimes@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22644
2019-12-31 16:15:47 +00:00
Michael Tuexen
493c98c6d2 Add flags for upcoming patches related to improved ECN handling.
No functional change.
Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22429
2019-12-31 14:32:48 +00:00
Michael Tuexen
83a2839fb9 Clear the flag indicating that the last received packet was marked CE also
in the case where a packet not marked was received.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D19143
2019-12-31 14:23:52 +00:00
Michael Tuexen
7d87664a04 Add curly braces missed in https://svnweb.freebsd.org/changeset/base/354773
Sponsored by:		Netflix, Inc.
CID:			1407649
2019-12-31 12:29:01 +00:00
Michael Tuexen
6088175a18 Improve input validation for some parameters having a too small
reported length.

Thanks to Natalie Silvanovich from Google for finding one of these
issues in the SCTP userland stack and reporting it.

MFC after:		1 week
2019-12-20 15:25:08 +00:00
Alexander V. Chernikov
bdb214a4a4 Remove useless code from in6_rmx.c
The code in questions walks IPv6 tree every 60 seconds and looks into
 the routes with non-zero expiration time (typically, redirected routes).
For each such route it sets RTF_PROBEMTU flag at the expiration time.
No other part of the kernel checks for RTF_PROBEMTU flag.

RTF_PROBEMTU was defined 21 years ago, 30 Jun 1999, as RTF_PROTO1.
RTF_PROTO1 is a de-facto standard indication of a route installed
 by a routing daemon for a last decade.

Reviewed by:	bz, ae
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22865
2019-12-18 22:10:56 +00:00
Hans Petter Selasky
a4c5668d12 Leave multicast group before reaping and committing state for both
IPv4 and IPv6.

This fixes a regression issue after r349369. When trying to exit a
multicast group before closing the socket, a multicast leave packet
should be sent.

Differential Revision:	https://reviews.freebsd.org/D22848
PR: 242677
Reviewed by:	bz (network)
Tested by:	Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-12-18 12:06:34 +00:00
Randall Stewart
1cf55767b8 This commit is a bit of a re-arrange of deck chairs. It
gets both rack and bbr ready for the completion of the STATs
framework in FreeBSD. For now if you don't have both NF_stats and
stats on it disables them. As soon as the rest of the stats framework
lands we can remove that restriction and then just uses stats when
defined.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D22479
2019-12-17 16:08:07 +00:00
John Baldwin
5773ac113c Use callout_func_t instead of the deprecated timeout_t.
Reviewed by:	kib, imp
Differential Revision:	https://reviews.freebsd.org/D22752
2019-12-10 22:06:53 +00:00
Bjoern A. Zeeb
646e34881c Remove the extra epoch tracker change sneaked into r355449 and was not part
of the originally reviewed or described change.

Pointyhat to:   bz
Reported by:    glebius
2019-12-06 22:20:26 +00:00
Bjoern A. Zeeb
dad68fc301 carp: replace caddr_t with char *
Change the remaining caddr_t usages to char * following the removal
of the KAME macros

No functional change.

Requested by:	glebius
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Netflix (originally)
Differential Revision:	https://reviews.freebsd.org/D22399
2019-12-06 16:35:48 +00:00
Gleb Smirnoff
e5a084d020 Fix regression from r354484. Don't leak pcb lock if cr_canseeinpcb()
returns non-zero.

PR:		242415
2019-12-04 22:41:52 +00:00
Bjoern A. Zeeb
0700d2c3f0 Make icmp6_reflect() static.
icmp6_reflect() is not used anywhere outside icmp6.c, no reason to
export it.

Sponsored by:	Netflix
2019-12-03 14:46:38 +00:00
Hans Petter Selasky
5b64b824b9 Use refcount from "in_joingroup_locked()" when joining multicast
groups. Do not acquire additional references. This makes the IPv4 IGMP
code in line with the IPv6 MLD code.

Background:
The IPv4 multicast code puts an extra reference on the in_multi struct
when joining groups.  This becomes visible when using daemons like
igmpproxy from ports, that multicast entries do not disappear from the
output of ifmcstat(8) when multicast streams are disconnected.

This fixes a regression issue after r349762.

While at it factor the ip_mfilter_insert() and ip6_mfilter_insert() calls
to avoid repeated "is_new" check.

Differential Revision:	https://reviews.freebsd.org/D22595
Tested by:	Guido van Rooij <guido@gvr.org>
Reviewed by:	rgrimes (network)
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-12-03 08:46:59 +00:00
Edward Tomasz Napierala
adc56f5a38 Make use of the stats(3) framework in the TCP stack.
This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time.  stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by:	thj
Tested by:	thj
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20655
2019-12-02 20:58:04 +00:00
Michael Tuexen
3cf38784e2 Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by:		Richard Scheffenegger
Reviewed by:		rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22497
2019-12-01 21:01:33 +00:00
Michael Tuexen
77aabfd94f Make the TF_* flags easier readable by humans by adding leading zeroes
to make them aligned.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22428
2019-12-01 20:45:48 +00:00
Michael Tuexen
b72e56e758 This is an initial step in implementing the new congestion window
validation as specified in RFC 7661.

Submitted by:		Richard Scheffenegger
Reviewed by:		rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D21798
2019-12-01 20:35:41 +00:00
Michael Tuexen
8df12ffcc2 Make the IPTOS value available to all substate handlers. This will allow
to add support for L4S or SCE, which require processing of the IP TOS
field.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22426
2019-12-01 18:47:53 +00:00
Michael Tuexen
fa49a96419 In order for the TCP Handshake to support ECN++, and further ECN-related
improvements, the ECN bits need to be exposed to the TCP SYNcache.
This change is a minimal modification to the function headers, without any
functional change intended.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22436
2019-12-01 18:05:02 +00:00
Michael Tuexen
669a285ffb When changing the MTU of an SCTP path, not only cancel all ongoing
RTT measurements, but also scheldule new ones for the future.

Submitted by:		Julius Flohr
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D22547
2019-12-01 17:35:36 +00:00
Bjoern A. Zeeb
a4adf6cc65 Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros.
r354748-354750 replaced the KAME macros with m_pulldown() calls.
Contrary to the rest of the network stack m_len checks before m_pulldown()
were not put in placed (see r354748).
Put these m_len checks in place for now (to go along with the style of the
network stack since the initial commits).  These are not put in for
performance but to avoid an error scenario (even though it also will help
performance at the moment as it avoid allocating an extra mbuf; not because
of the unconditional function call).

The observed error case went like this:
(1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it.
(2) m_pullup() will call m_get() unless the requested length is larger than
MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the
requested length of data and pkthdr into the new mbuf.
(3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail.
This was observed with failing auto-configuration as an RA packet of
200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input()
dropped the mbuf.
(Re-)adding the m_len checks before m_pullup() calls avoids this problems
with mbufs using external storage for now.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-12-01 00:22:04 +00:00
Michael Tuexen
f727fee546 Really ignore the SCTP association identifier on 1-to-1 style sockets
as requiresd by the socket API specification.
Thanks to Inaki Baz Castillo, who found this bug running the userland
stack with valgrind and reported the issue in
https://github.com/sctplab/usrsctp/issues/408

MFC after:		1 week
2019-11-28 12:50:25 +00:00
Michael Tuexen
645f3a1cd1 Plug two mbuf leaks during INIT-ACK handling.
One leak happens when there is not enough memory to allocate the
the resources for streams. The other leak happens if the are
unknown parameters in the received INIT-ACK chunk which require
reporting and the INIT-ACK requires sending an ABORT due to illegal
parameter combinations.
Hopefully this fixes
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19083

MFC after:		1 week
2019-11-27 19:32:29 +00:00
Ryan Libby
200f3ac6f7 in_mcast.c: need if_addr_lock around inm_release_deferred
Apply a similar fix as for in6_mcast.c.

Reviewed by:	hselasky
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20740
2019-11-25 22:25:34 +00:00
Bjoern A. Zeeb
1a117215c7 Reduce the vnet_set module size of ip_mroute to allow loading as a module.
With VIMAGE kernels modules get special treatment as they need
to also keep the original values and make copies for each instance.
For that a few pages of vnet modspace are provided and the
kernel-linker and the VNET framework know how to deal with things.
When the modspace is (almost) full, other modules which would
overflow the modspace cannot be loaded and kldload will fail.

ip_mroute uses a lot of variable space, mostly be four big arrays:
set_vnet 0000000000000510 vnet_entry_multicast_register_if
set_vnet 0000000000000700 vnet_entry_viftable
set_vnet 0000000000002000 vnet_entry_bw_meter_timers
set_vnet 0000000000002800 vnet_entry_bw_upcalls

Dynamically malloc the three big ones for each instance we need
and free them again on vnet teardown (the 4th is an ifnet).
That way they only need module space for a single pointer and
allow a lot more modules using virtualized variables to be loaded
on a VNET kernel.

PR:		206583
Reviewed by:	hselasky, kp
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D22443
2019-11-19 15:38:55 +00:00
Michael Tuexen
c968c769af Add boundary and overflow checks to the formulas used in the TCP CUBIC
congestion control module.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@
Differential Revision:	https://reviews.freebsd.org/D19118
2019-11-16 12:00:22 +00:00
Michael Tuexen
b0c1a13e4e Improve TCP CUBIC specific after idle reaction.
The adjustments are inspired by the Linux stack, which has had a
functionally equivalent implementation for more than a decade now.

Submitted by:		Richard Scheffenegger
Reviewed by:		Cheng Cui
Differential Revision:	https://reviews.freebsd.org/D18982
2019-11-16 11:57:12 +00:00
Michael Tuexen
35cd141b4b Implement a tCP CUBIC-specific after idle reaction.
This patch addresses a very common case of frequent application stalls,
where TCP runs idle and looses the state of the network.

Submitted by:		Richard Scheffenegger
Reviewed by:		Cheng Cui
Differential Revision:	https://reviews.freebsd.org/D18954
2019-11-16 11:37:26 +00:00
Michael Tuexen
453e633384 Revert https://svnweb.freebsd.org/changeset/base/354708
I used the wrong Differential Revision, so back it out and do it right
in a follow-up commit.
2019-11-16 11:10:09 +00:00
Bjoern A. Zeeb
b141dd5ddf Remove now unused IPv6 macros and update docs.
After r354748-354750 all uses of the IP6_EXTHDR_CHECK() and
IP6_EXTHDR_GET() macros are gone from the kernel.  IP6_EXTHDR_GET0()
was unused.  Remove the macros and update the documentation.

Sponsored by:	Netflix
2019-11-15 21:55:41 +00:00
Bjoern A. Zeeb
4e619b17c5 IP6_EXTHDR_CHECK(): remove the last instances
While r354748 removed almost all IP6_EXTHDR_CHECK() calls, these
are not part of the PULLDOWN_TESTS.
Equally convert these IP6_EXTHDR_CHECK()s here to m_pullup() and remove
the extra check and m_pullup() in tcp_input() under isipv6 given
tcp6_input() has done exactly that pullup already.

MFC after:	8 weeks
Sponsored by:	Netflix
2019-11-15 21:51:43 +00:00
Bjoern A. Zeeb
63abacc204 netinet*: replace IP6_EXTHDR_GET()
In a few places we have IP6_EXTHDR_GET() left in upper layer protocols.
The IP6_EXTHDR_GET() macro might perform an m_pulldown() in case the data
fragment is not contiguous.

Convert these last remaining instances into m_pullup()s instead.
In CARP, for example, we will a few lines later call m_pullup() anyway,
the IPsec code coming from OpenBSD would otherwise have done the m_pullup()
and are copying the data a bit later anyway, so pulling it in seems no
better or worse.

Note: this leaves very few m_pulldown() cases behind in the tree and we
might want to consider removing them as well to make mbuf management
easier again on a path to variable size mbufs, especially given
m_pulldown() still has an issue not re-checking M_WRITEABLE().

Reviewed by:	gallatin
MFC after:	8 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22335
2019-11-15 21:44:17 +00:00
Michael Tuexen
730cbbc10d For idle TCP sessions using the CUBIC congestio control, reset ssthresh
to the higher of the previous ssthresh or 3/4 of the prior cwnd.

Submitted by:		Richard Scheffenegger
Reviewed by:		Cheng Cui
Differential Revision:	https://reviews.freebsd.org/D18982
2019-11-14 16:28:02 +00:00
Bjoern A. Zeeb
a8fe77d877 netinet*: update *mp to pass the proper value back
In ip6_[direct_]input() we are looping over the extension headers
to deal with the next header.  We pass a pointer to an mbuf pointer
to the handling functions.  In certain cases the mbuf can be updated
there and we need to pass the new one back.  That missing in
dest6_input() and route6_input().  In tcp6_input() we should also
update it before we call tcp_input().

In addition to that mark the mbuf NULL all the times when we return
that we are done with handling the packet and no next header should
be checked (IPPROTO_DONE).  This will eventually allow us to assert
proper behaviour and catch the above kind of errors more easily,
expecting *mp to always be set.

This change is extracted from a larger patch and not an exhaustive
change across the entire stack yet.

PR:			240135
Reported by:		prabhakar.lakhera gmail.com
MFC after:		3 weeks
Sponsored by:		Netflix
2019-11-12 15:46:28 +00:00
Gleb Smirnoff
273b2e4c55 Remove now unused INP_INFO_RLOCK macros. 2019-11-07 22:26:54 +00:00
Gleb Smirnoff
43e8b279b8 In TCP HPTS enter the epoch in tcp_hpts_thread() and assert it in
the leaf functions.
2019-11-07 21:30:27 +00:00
Gleb Smirnoff
aed553598d Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP
timewait manipulation leaf functions.
2019-11-07 21:29:38 +00:00
Gleb Smirnoff
a81e3ecf46 Since pfslowtimo() runs in the network epoch, tcp_slowtimo()
also does.  This allows to simplify tcp_tw_2msl_scan() and
always require the network epoch in it.
2019-11-07 21:28:46 +00:00
Gleb Smirnoff
032677ceb5 Now that there is no R/W lock on PCB list the pcblist sysctls
handlers can be greatly simplified.  All the previous double
cycling and complex locking was added to avoid these functions
holding global PCB locks for extended period of time, preventing
addition of new entries.
2019-11-07 21:27:32 +00:00
Gleb Smirnoff
d40c0d47cd Now that all of the tcp_input() and all its branches are executed
in the network epoch, we can greatly simplify synchronization.
Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro.
Remove some unneccesary assertions and convert necessary ones into the
NET_EPOCH_ASSERT macro.
2019-11-07 21:23:07 +00:00
Gleb Smirnoff
80577e5583 Remove unnecessary recursive epoch enter via INP_INFO_RLOCK
macro in udp_input().  It shall always run in the network epoch.
2019-11-07 21:08:49 +00:00
Gleb Smirnoff
5015a05f0b Remove now unused INP_HASH_RLOCK() macros. 2019-11-07 21:03:15 +00:00
Gleb Smirnoff
2435e507de Now with epoch synchronized PCB lookup tables we can greatly simplify
locking in udp_output() and udp6_output().

First, we select if we need read or write lock in PCB itself, we take
the lock and enter network epoch.  Then, we proceed for the rest of
the function.  In case if we need to modify PCB hash, we would take
write lock on it for a short piece of code.

We could exit the epoch before allocating an mbuf, but with this
patch we are keeping it all the way into ip_output()/ip6_output().
Today this creates an epoch recursion, since ip_output() enters epoch
itself.  However, once all protocols are reviewed, ip_output() and
ip6_output() would require epoch instead of entering it.

Note: I'm not 100% sure that in udp6_output() the epoch is required.
We don't do PCB hash lookup for a bound socket.  And all branches of
in6_select_src() don't require epoch, at least they lack assertions.
Today inet6 address list is protected by rmlock, although it is CKLIST.
AFAIU, the future plan is to protect it by network epoch.  That would
require epoch in in6_select_src().  Anyway, in future ip6_output()
would require epoch, udp6_output() would need to enter it.
2019-11-07 21:01:36 +00:00
Gleb Smirnoff
5a1264335d Add INP_UNLOCK() which will do whatever R/W unlock is required. 2019-11-07 20:57:51 +00:00
Gleb Smirnoff
d797164a86 Since r353292 on input path we are always in network epoch, when
we lookup PCBs.  Thus, do not enter epoch recursively in
in_pcblookup_hash() and in6_pcblookup_hash().  Same applies to
tcp_ctlinput() and tcp6_ctlinput().

This leaves several sysctl(9) handlers that return PCB credentials
unprotected.  Add epoch enter/exit to all of them.

Differential Revision:	https://reviews.freebsd.org/D22197
2019-11-07 20:49:56 +00:00
Gleb Smirnoff
de537d63c2 Remove unnecessary recursive epoch enter via INP_INFO_RLOCK
macro in divert_packet().  This function is called only from
pfil(9) filters, which in their place always run in the
network epoch.
2019-11-07 20:44:34 +00:00
Gleb Smirnoff
f42347c39a Remove unnecessary recursive epoch enter via INP_INFO_RLOCK
macro in raw input functions for IPv4 and IPv6.  They shall
always run in the network epoch.
2019-11-07 20:40:44 +00:00
Bjoern A. Zeeb
503f4e4736 netinet*: variable cleanup
In preparation for another change factor out various variable cleanups.
These mainly include:
(1) do not assign values to variables during declaration:  this makes
    the code more readable and does allow for better grouping of
    variable declarations,
(2) do not assign values to variables before need; e.g., if a variable
    is only used in the 2nd half of a function and we have multiple
    return paths before that, then do not set it before it is needed, and
(3) try to avoid assigning the same value multiple times.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-11-07 18:29:51 +00:00
Gleb Smirnoff
58d94bd0d9 TCP timers are executed in callout context, so they need to enter network
epoch to look into PCB lists.  Mechanically convert INP_INFO_RLOCK() to
NET_EPOCH_ENTER().  No functional change here.
2019-11-07 00:27:23 +00:00
Gleb Smirnoff
97a95ee134 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in
TCP functions that are executed in syscall context.  No
functional change here.
2019-11-07 00:10:14 +00:00
Gleb Smirnoff
1a49612526 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
Bjoern A. Zeeb
6e6b5143f5 Properly set VNET when nuking recvif from fragment queues.
In theory the eventhandler invoke should be in the same VNET as
the the current interface. We however cannot guarantee that for
all cases in the future.

So before checking if the fragmentation handling for this VNET
is active, switch the VNET to the VNET of the interface to always
get the one we want.

Reviewed by:	hselasky
MFC after:	3 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22153
2019-10-25 18:54:06 +00:00
Michael Tuexen
4a91aa8fc9 Ensure that the flags indicating IPv4/IPv6 are not changed by failing
bind() calls. This would lead to inconsistent state resulting in a panic.
A fix for stable/11 was committed in
https://svnweb.freebsd.org/base?view=revision&revision=338986
An accelerated MFC is planned as discussed with emaste@.

Reported by:		syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com
Obtained from:		jtl@
MFC after:		1 day
Sponsored by:		Netflix, Inc.
2019-10-24 20:05:10 +00:00
Michael Tuexen
9f36ec8bba Store a handle for the event handler. This will be used when unloading the
SCTP as a module.

Obtained from:		markj@
2019-10-24 09:22:23 +00:00
Randall Stewart
9992c365b6 Fix a small bug in bbr when running under a VM. Basically what
happens is we are more delayed in the pacer calling in so
we remove the stack from the pacer and recalculate how
much time is left after all data has been acknowledged. However
the comparision was backwards so we end up with a negative
value in the last_pacing_delay time which causes us to
add in a huge value to the next pacing time thus stalling
the connection.

Reported by:	vm2.finance@gmail.com
2019-10-24 05:54:30 +00:00
Michael Tuexen
4ad8cb6813 Fix compile issues when building a kernel without the VIMAGE option.
Thanks to cem@ for discussing the issue which resulted in this patch.

Reviewed by:		cem@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D22089
2019-10-19 20:48:53 +00:00
Conrad Meyer
e9c6962599 debugnet(4): Add optional full-duplex mode
It remains unattached to any client protocol.  Netdump is unaffected
(remaining half-duplex).  The intended consumer is NetGDB.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed with:	markj
Differential Revision:	https://reviews.freebsd.org/D21541
2019-10-17 20:25:15 +00:00
Conrad Meyer
fde2cf65ce debugnet(4): Infer non-server connection parameters
Loosen requirements for connecting to debugnet-type servers.  Only require a
destination address; the rest can theoretically be inferred from the routing
table.

Relax corresponding constraints in netdump(4) and move ifp validation to
debugnet connection time.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D21482
2019-10-17 20:10:32 +00:00
Conrad Meyer
8270d35eca Add ddb(4) 'netdump' command to netdump a core without preconfiguration
Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine
to the generic debugnet code.  The imagined use is both netdump, shown here,
and NetGDB (vaporware).  It uses the ddb(4) lexer, with some new extensions,
to parse out IPv4 addresses.

'Netdump' uses the generic debugnet routine to load a configuration and
start a dump, without any netdump configuration prior to panic.

Loosely derived from work by:	John Reimer <john.reimer AT emc.com>
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D21460
2019-10-17 19:49:20 +00:00
Gleb Smirnoff
c0ebee428e Quickly fix up r353683: enter the epoch before calling into netisr_dispatch(). 2019-10-17 17:02:50 +00:00
Conrad Meyer
7790c8c199 Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
Gleb Smirnoff
756368b68b igmp_v1v2_queue_report() doesn't require epoch. 2019-10-17 16:02:34 +00:00
Hans Petter Selasky
a55383e720 Fix panic in network stack due to use after free when receiving
partial fragmented packets before a network interface is detached.

When sending IPv4 or IPv6 fragmented packets and a fragment is lost
before the network device is freed, the mbuf making up the fragment
will remain in the temporary hashed fragment list and cause a panic
when it times out due to accessing a freed network interface
structure.


1) Make sure the m_pkthdr.rcvif always points to a valid network
interface. Else the rcvif field should be set to NULL.

2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for
the fully defragged packet, instead of the first received fragment.

Panic backtrace for IPv6:

panic()
icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx
icmp6_error()
frag6_freef()
frag6_slowtimo()
pfslowtimo()
softclock_call_cc()
softclock()
ithread_loop()

Reviewed by:	bz
Differential Revision:	https://reviews.freebsd.org/D19622
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-10-16 09:11:49 +00:00
Michael Tuexen
776cd558f0 Separate out SCTP related dtrace code.
This is based on work done by markj@.

Discussed with:		markj@
MFC after:		3 days
2019-10-14 20:32:11 +00:00
Randall Stewart
8ee1cf039e if_hw_tsomaxsegsize needs to be initialized to zero, just
like in bbr.c and tcp_output.c
2019-10-14 13:10:29 +00:00
Michael Tuexen
fcfd8ad537 Rename sctp_dtrace_declare.h to sctp_kdtrace.h for consistentcy.
MFC after:		3 days
2019-10-14 13:02:49 +00:00
Gleb Smirnoff
5df91cbe02 Revert r353313. It is not needed with r353357 and is actually incorrect. 2019-10-14 04:10:00 +00:00
Michael Tuexen
d6e23cf0cf Use an event handler to notify the SCTP about IP address changes
instead of calling an SCTP specific function from the IP code.
This is a requirement of supporting SCTP as a kernel loadable module.
This patch was developed by markj@, I tweaked a bit the SCTP related
code.

Submitted by:		markj@
MFC after:		3 days
2019-10-13 18:17:08 +00:00
Mark Johnston
671d68fad9 Move SCTP DTrace probe definitions into a .c file.
Previously they were defined in a header which was included exactly
once.  Change this to follow the usual practice of putting definitions
in C files.  No functional change intended.

Discussed with:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-10-13 16:14:04 +00:00
Michael Tuexen
1b6ddd9404 Ensure that local variables are reset to their initial value when
dealing with error cases in a loop over all remote addresses.
This issue was found and reported by OSS_Fuzz in:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18080
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18086
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18121
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18163

MFC after:		3 days
2019-10-12 17:57:03 +00:00
Gleb Smirnoff
1e5db73d07 The divert(4) module must always be running in network epoch, thus
call to if_addr_rlock() isn't needed.
2019-10-10 23:48:42 +00:00
Warner Losh
b23b156e2e Fix casting error from newer gcc
Cast the pointers to (uintptr_t) before assigning to type
uint64_t. This eliminates an error from gcc when we cast the pointer
to a larger integer type.
2019-10-09 21:02:06 +00:00
Hans Petter Selasky
eabddb25a3 Factor out TCP rateset destruction code.
Ensure the epoch_call() function is not called more than one time
before the callback has been executed, by always checking the
RS_FUNERAL_SCHD flag before invoking epoch_call().

The "rs_number_dead" is balanced again after r353353.

Discussed with:	rrs@
Sponsored by:	Mellanox Technologies
2019-10-09 17:08:40 +00:00
Gleb Smirnoff
0732ac0eff Revert most of the multicast changes from r353292. This needs a more
accurate approach.
2019-10-09 17:03:20 +00:00
Hans Petter Selasky
24be13533b Fix locking order reversal in the TCP ratelimit code by moving
destructors outside the rsmtx mutex.

Witness message:
lock order reversal: (sleepable after non-sleepable)
   1st tcp_rs_mtx (rsmtx) @ sys/netinet/tcp_ratelimit.c:242
   2nd sysctl lock (sysctl lock) @ sys/kern/kern_sysctl.c:607

Backtrace:
witness_debugger
witness_checkorder
_rm_wlock_debug
sysctl_ctx_free
rs_destroy
epoch_call_task
gtaskqueue_run_locked
gtaskqueue_thread_loop

Discussed with:	rrs@
Sponsored by:	Mellanox Technologies
2019-10-09 16:48:48 +00:00
John Baldwin
9e14430d46 Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2019-10-08 21:34:06 +00:00
Gleb Smirnoff
e4c40a8a71 Quickly plug another regression from r353292. Again, multicast locking needs
lots of work...

Reported by:	pho
2019-10-08 16:59:17 +00:00
Michael Tuexen
953b78bed9 Validate length before use it, not vice versa.
r353060 should have contained this...
This fixes
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18070
MFC after:		3 days
2019-10-08 11:07:16 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Michael Tuexen
746c7ae563 In r343587 a simple port filter as sysctl tunable was added to siftr.
The new sysctl was not added to the siftr.4 man page at the time.
This updates the man page, and removes one left over trailing whitespace.

Submitted by:		Richard Scheffenegger
Reviewed by:		bcr@
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D21619
2019-10-07 20:35:04 +00:00
Randall Stewart
5b63b22075 Brad Davis identified a problem with the new LRO code, VLAN's
no longer worked. The problem was that the defines used the
same space as the VLAN id. This commit does three things.
1) Move the LRO used fields to the PH_per fields. This is
   safe since the entire PH_per is used for IP reassembly
   which LRO code will not hit.
2) Remove old unused pace fields that are not used in mbuf.h
3) The VLAN processing is not in the mbuf queueing code. Consequently
   if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing
   for now until rack_bbr_common is updated to handle the VLAN properly.

Reported by:	Brad Davis
2019-10-06 22:29:02 +00:00
Michael Tuexen
63fb39ba7b Plumb an mbuf leak in a code path that should not be taken. Also avoid
that this path is taken by setting the tail pointer correctly.
There is still bug related to handling unordered unfragmented messages
which were delayed in deferred handling.
This issue was found by OSS-Fuzz testing the usrsctp stack and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17794

MFC after:		3 days
2019-10-06 08:47:10 +00:00
Michael Tuexen
2560cb1eb8 Fix a use after free bug when removing remote addresses.
This bug was found by OSS-Fuzz and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18004

MFC after:		3 days
2019-10-05 13:28:01 +00:00
Michael Tuexen
0941b9dc37 Plumb an mbuf leak found by Mark Wodrich from Google by fuzz testing the
userland stack and reporting it in:
https://github.com/sctplab/usrsctp/issues/396

MFC after:		3 days
2019-10-05 12:34:50 +00:00
Michael Tuexen
44f788d793 Fix the adding of padding to COOKIE-ECHO chunks.
Thanks to Mark Wodrich who found this issue while fuzz testing the
usrsctp stack and reported the issue in
https://github.com/sctplab/usrsctp/issues/382

MFC after:		3 days
2019-10-05 09:46:11 +00:00
Michael Tuexen
aac50dab6d When skipping the address parameter, take the padding into account.
MFC after:		3 days
2019-10-03 20:47:57 +00:00
Michael Tuexen
5989470c37 Cleanup sctp_asconf_error_response() and ensure that the parameter
is padded as required. This fixes the followig bug reported by
OSS-Fuzz for the usersctp stack:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17790

MFC after:		3 days
2019-10-03 20:39:17 +00:00
Michael Tuexen
967e1a5333 Add missing input validation. This could result in reading from
uninitialized memory.
The issue was found by OSS-Fuzz for usrsctp  and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17780

MFC after:		3 days
2019-10-03 18:36:54 +00:00
Michael Tuexen
2974e263c3 Don't use stack memory which is not initialized.
Thanks to Mark Wodrich for reporting this issue for the userland stack in
https://github.com/sctplab/usrsctp/issues/380
This issue was also found for usrsctp by OSS-fuzz in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17778

MFC after:		3 days
2019-09-30 12:06:57 +00:00
Michael Tuexen
12a43d0d5d RFC 7112 requires a host to put the complete IP header chain
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.

Reviewed by:		jtl@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21665
2019-09-29 10:45:13 +00:00
Michael Tuexen
71e85612a9 Replacing MD5 by SipHash improves the performance of the TCP time stamp
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.

Reviewed by:		gallatin@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21616
2019-09-28 13:13:23 +00:00
Michael Tuexen
79c2a2a07b Ensure that the INP lock is released before leaving [gs]etsockopt()
for RACK specific socket options.
These issues were found by a syzkaller instance.
Reviewed by:		rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21825
2019-09-28 13:05:37 +00:00
Jonathan T. Looney
0b18fb0798 Add new functionality to switch to using cookies exclusively when we the
syn cache overflows. Whether this is due to an attack or due to the system
having more legitimate connections than the syn cache can hold, this
situation can quickly impact performance.

To make the system perform better during these periods, the code will now
switch to exclusively using cookies until the syn cache stops overflowing.
In order for this to occur, the system must be configured to use the syn
cache with syn cookie fallback. If syn cookies are completely disabled,
this change should have no functional impact.

When the system is exclusively using syn cookies (either due to
configuration or the overflow detection enabled by this change), the
code will now skip acquiring a lock on the syn cache bucket. Additionally,
the code will now skip lookups in several places (such as when the system
receives a RST in response to a SYN|ACK frame).

Reviewed by:	rrs, gallatin (previous version)
Discussed with:	tuexen
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:18:57 +00:00
Jonathan T. Looney
0bee4d631a Access the syncache secret directly from the V_tcp_syncache variable,
rather than indirectly through the backpointer to the tcp_syncache
structure stored in the hashtable bucket.

This also allows us to remove the requirement in syncookie_generate()
and syncookie_lookup() that the syncache hashtable bucket must be
locked.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:06:46 +00:00
Jonathan T. Looney
867e98f8ee Remove the unused sch parameter to the syncache_respond() function. The
use of this parameter was removed in r313330. This commit now removes
passing this now-unused parameter.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:02:34 +00:00
Randall Stewart
ac7bd23a7a lets put (void) in a couple of functions to keep older platforms that
are stuck with gcc happy (ppc). The changes are needed in both bbr and
rack.

Obtained from:	Michael Tuexen (mtuexen@)
2019-09-24 20:36:43 +00:00
Randall Stewart
da99b33b17 don't call in_ratelmit detach when RATELIMIT is not
compiled in the kernel.
2019-09-24 20:11:55 +00:00
Randall Stewart
2f1cc984db Fix the ifdefs in tcp_ratelimit.h. They were reversed so
that instead of functions only being inside the _KERNEL and
the absence of RATELIMIT causing us to have NULL/error returning
interfaces we ended up with non-kernel getting the error path.
opps..
2019-09-24 20:04:31 +00:00
Randall Stewart
35c7bb3407 This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on  implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D21582
2019-09-24 18:18:11 +00:00
Michael Tuexen
2b861c1538 Plumb a memory leak.
Thnanks to Felix Weinrank for finding this issue using fuzz testing
and reporting it for the userland stack:
https://github.com/sctplab/usrsctp/issues/378

MFC after:		3 days
2019-09-24 13:15:24 +00:00
Michael Tuexen
1325a0de13 Don't hold the info lock when calling sctp_select_a_tag().
This avoids a double lock bug in the NAT colliding state processing
of SCTP. Thanks to Felix Weinrank for finding and reporting this issue in
https://github.com/sctplab/usrsctp/issues/374
He found this bug using fuzz testing.

MFC after:		3 days
2019-09-22 11:11:01 +00:00
Michael Tuexen
44f2a3272e Cleanup the RTO calculation and perform some consistency checks
before computing the RTO.
This should fix an overflow issue reported by Felix Weinrank in
https://github.com/sctplab/usrsctp/issues/375
for the userland stack and found by running a fuzz tester.

MFC after:		3 days
2019-09-22 10:40:15 +00:00
Michael Tuexen
e6b3bd22d8 Fix the handling of invalid parameters in ASCONF chunks.
Thanks to Mark Wodrich from Google for reproting the issue in
https://github.com/sctplab/usrsctp/issues/376
for the userland stack.

MFC after:		3 days
2019-09-20 08:20:20 +00:00
Michael Tuexen
dd3121a895 When the RACK stack computes the space for user data in a TCP segment,
it wasn't taking the IP level options into account. This patch fixes this.
In addition, it also corrects a KASSERT and adds protection code to assure
that the IP header chain and the TCP head fit in the first fragment as
required by RFC 7112.

Reviewed by:		rrs@
MFC after:		3 days
Sponsored by:		Nertflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21666
2019-09-19 10:27:47 +00:00
Michael Tuexen
3c19311544 Only allow a SCTP-AUTH shared key to be updated by the application
if it is not deactivated and not used.
This avoids a use-after-free problem.

Reported by:		da_cheng_shao@yeah.net
MFC after:		3 days
2019-09-17 09:46:42 +00:00
Michael Tuexen
5b66b7f11b Don't write to memory outside of the allocated array for SACK blocks.
Obtained from:		rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-09-16 08:18:05 +00:00
Michael Tuexen
52f6a8090e When the IP layer calls back into the SCTP layer to perform the SCTP
checksum computation, do not assume that the IP header chain and the
SCTP common header are in contiguous memory although the SCTP lays
out the mbuf chains that way. If there are IP-level options inserted
by the IP layer, the constraint is not fulfilled anymore.

This issues was found by running syzkaller. Thanks to markj@ who is
running an instance which also provides kernel dumps. This allowed me
to find this issue.

MFC after:		3 days
2019-09-15 18:29:45 +00:00
Andrew Gallatin
d2e6258258 Avoid unneeded call to arc4random() in syncache_add()
Don't call arc4random() unconditionally to initialize sc_iss, and
then when syncookies are enabled, just overwrite it with the
return value from from syncookie_generate(). Instead, only call
arc4random() to initialize sc_iss when syncookies are not
enabled.

Note that on a system under a syn flood attack, arc4random()
becomes quite expensive, and the chacha_poly crypto that it calls
is one of the more expensive things happening on the
system. Removing this unneeded arc4random() call reduces CPU from
about 40% to about 35% in my test scenario (Broadwell Xeon, 6Mpps
syn flood attack).

Reviewed by:	rrs, tuxen, bz
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21591
2019-09-11 18:48:26 +00:00
Randall Stewart
6f32ca1936 With the recent commit of ktls, we no longer have a
sb_tls_flags, its just the sb_flags. Also the ratelimit
code, now that the defintion is in sockbuf.h, does not
need the ktls.h file (or its predecessor).

Sponsored by:	Netflix Inc
2019-09-11 15:41:36 +00:00
Michael Tuexen
6d261981ee Only update SACK/DSACK lists when a non-empty segment was received.
This fixes hitting a KASSERT with a valid packet exchange.

Reviewed by:		rrs@, Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21567
2019-09-09 16:07:47 +00:00
Conrad Meyer
373013b05f Fix build after r351934
tcp_queue_pkts() is only used if TCPHPTS is defined (and it is not by
default).

Reported by:	gcc
2019-09-06 18:33:39 +00:00
Randall Stewart
af9b9e0d9f This adds in the missing counter initialization which
I had forgotten to bring over.. opps.

Differential Revision: https://reviews.freebsd.org/D21127
2019-09-06 18:29:48 +00:00
Warner Losh
82e837f803 Initialize if_hw_tsomaxsegsize to 0 to appease gcc's flow analysis as a
fail-safe.
2019-09-06 18:25:42 +00:00
Randall Stewart
e57b2d0e51 This adds the final tweaks to LRO that will now allow me
to add BBR. These changes make it so you can get an
array of timestamps instead of a compressed ack/data segment.
BBR uses this to aid with its delivery estimates. We also
now (via Drew's suggestions) will not go to the expense of
the tcb lookup if no stack registers to want this feature. If
HPTS is not present the feature is not present either and you
just get the compressed behavior.

Sponsored by:	Netflix Inc
Differential Revision: https://reviews.freebsd.org/D21127
2019-09-06 14:25:41 +00:00
Michael Tuexen
ecc5b1d156 Fix the SACK block generation in the base TCP stack by bringing it in
sync with the RACK stack.

Reviewed by:		rrs@
MFC after:		5 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21513
2019-09-04 04:38:31 +00:00
Michael Tuexen
191ae5bfa9 Fix two TCP RACK issues:
* Convert the TCP delayed ACK timer from ms to ticks as required.
  This fixes the timer on platforms with hz != 1000.
* Don't delay acknowledgements which report duplicate data using
  DSACKs.

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21512
2019-09-03 19:48:02 +00:00
Michael Tuexen
fe5dee73f7 This patch improves the DSACK handling to conform with RFC 2883.
The lowest SACK block is used when multiple Blocks would be elegible as
DSACK blocks ACK blocks get reordered - while maintaining the ordering of
SACK blocks not relevant in the DSACK context is maintained.

Reviewed by:		rrs@, tuexen@
Obtained from:		Richard Scheffenegger
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D21038
2019-09-02 19:04:02 +00:00
Michael Tuexen
b21097894e Fix initialization of top_fsn.
MFC after:		3 days
2019-09-01 10:39:16 +00:00
Michael Tuexen
6182677fb3 Improve the handling of state cookie parameters in INIT-ACK chunks.
This fixes problem with parameters indicating a zero length or partial
parameters after an unknown parameter indicating to stop processing. It
also fixes a problem with state cookie parameters after unknown
parametes indicating to stop porcessing.
Thanks to Mark Wodrich from Google for finding two of these issues
by fuzz testing the userland stack and reporting them in
https://github.com/sctplab/usrsctp/issues/355
and
https://github.com/sctplab/usrsctp/issues/352

MFC after:		3 days
2019-09-01 10:09:53 +00:00
Michael Tuexen
e30a1788fa Improve function definition.
MFC after:		3 days
2019-08-31 13:13:40 +00:00
Michael Tuexen
ec24a1b67c Improve the handling of illegal sequence number combinations in received
data chunks. Abort the association if there are data chunks with larger
fragement sequence numbers than the fragement sequence of the last
fragment.
Thanks to Mark Wodrich from Google who found this issue by fuzz testing
the userland stack and reporting this issue in
https://github.com/sctplab/usrsctp/issues/355

MFC after:		3 days
2019-08-31 08:18:49 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
Michael Tuexen
15ddc5e43f Don't hold the rs_mtx lock while calling malloc().
Reviewed by:		rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21416
2019-08-26 16:23:47 +00:00
Randall Stewart
e13ad86c67 Fix an issue when TSO and Rack play together. Basically
an retransmission of the initial SYN (with data) would
cause us to strip the SYN and decrement/increase offset/len
which then caused us a -1 offset and a panic.

Reported by:	Larry Rosenman
(Michael Tuexen helped me debug this at the IETF)
2019-08-21 10:45:28 +00:00
Mark Johnston
ccdc986607 Fix netdump buffering after r348473.
nd_buf is used to buffer headers (for both the kernel dump itself and
for EKCD) before the final call to netdump_dumper(), which flushes
residual data in nd_buf.  As a result, a small portion of the residual
data would be corrupted.  This manifests when kernel dump compression
is enabled since both zstd and zlib detect the corruption during
decompression.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D21294
2019-08-19 16:29:51 +00:00
Andrey V. Elsukov
c4556b2fbe Save ip_ttl value and restore it after checksum calculation.
Since ipvoly is used for checksum calculation, part of original IP
header is zeroed. This part includes ip_ttl field, that can be used
later in IP_MINTTL socket option handling.

PR:		239799
MFC after:	1 week
2019-08-13 12:47:53 +00:00
Randall Stewart
23fa2dbc06 Place back in the dependency on HPTS via module depends versus
a fatal error in compiling. This was taken out by mistake
when I mis-merged from the 18q22p2 sources of rack in NF. Opps.

Reported by:	sbruno
2019-08-13 12:41:15 +00:00
Tom Jones
deb2aead4a Rename IPPROTO 33 from SEP to DCCP
IPPROTO 33 is DCCP in the IANA Registry:
https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml

IPPROTO_SEP was added about 20 years ago in r33804. The entries were added
straight from RFC1700, without regard to whether they were used.

The reference in RFC1700 for SEP is '[JC120] <mystery contact>', this is an
indication that the protocol number was probably in use in a private network.

As RFC1700 is no longer the authoritative list of internet numbers and that
IANA assinged 33 to DCCP in RFC4340, change the header to the actual
authoritative source.

Reviewed by:	Richard Scheffenegger, bz
Approved by:	bz (mentor)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21178
2019-08-08 11:43:09 +00:00
Michael Tuexen
47d16bf9c1 Fix a typo.
Submitted by:		Thomas Dreibholz
MFC after:		1 week
2019-08-08 08:23:27 +00:00
Michael Tuexen
cd2de8b735 Fix a locking issue in sctp_accept.
PR:			238520
Reported by:		pho@
MFC after:		1 week
2019-08-06 10:29:19 +00:00