Commit Graph

6680 Commits

Author SHA1 Message Date
Michael Tuexen
306c2ba375 Small cleanup due to upstream ifdef cleanups.
MFC after:		1 week
2020-06-12 10:13:23 +00:00
Michael Tuexen
28397ac1ed Non-functional changes due to upstream cleanup.
MFC after:		1 week
2020-06-11 13:34:09 +00:00
Richard Scheffenegger
2fda0a6f3a Prevent TCP Cubic to abruptly increase cwnd after app-limited
Cubic calculates the new cwnd based on absolute time
elapsed since the start of an epoch. A cubic epoch is
started on congestion events, or once the congestion
avoidance phase is started, after slow-start has
completed.

When a sender is application limited for an extended
amount of time and subsequently a larger volume of data
becomes ready for sending, Cubic recalculates cwnd
with a lingering cubic epoch. This recalculation
of the cwnd can induce a massive increase in cwnd,
causing a burst of data to be sent at line rate by
the sender.

This adds a flag to reset the cubic epoch once a
session transitions from app-limited to cwnd-limited
to prevent the above effect.

Reviewed by:	chengc_netapp.com, tuexen (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25065
2020-06-10 07:32:02 +00:00
Richard Scheffenegger
6907bbae18 Prevent TCP Cubic to abruptly increase cwnd after slow-start
Introducing flags to track the initial Wmax dragging and exit
from slow-start in TCP Cubic. This prevents sudden jumps in the
caluclated cwnd by cubic, especially when the flow is application
limited during slow start (cwnd can not grow as fast as expected).
The downside is that cubic may remain slightly longer in the
concave region before starting the convex region beyond Wmax again.

Reviewed by:	chengc_netapp.com, tuexen (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor, blanket)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23655
2020-06-09 21:07:58 +00:00
Michael Tuexen
5fb132abbb Whitespace cleanups and removal of a stale comment.
MFC after:		1 week
2020-06-08 20:23:20 +00:00
Randall Stewart
e854dd38ac An important statistic in determining if a server process (or client) is being delayed
is to know the time to first byte in and time to first byte out. Currently we
have no way to know these all we have is t_starttime. That (t_starttime) tells us
what time the 3 way handshake completed. We don't know when the first
request came in or how quickly we responded. Nor from a client perspective
do we know how long from when we sent out the first byte before the
server responded.

This small change adds the ability to track the TTFB's. This will show up in
BB logging which then can be pulled for later analysis. Note that currently
the tracking is via the ticks variable of all three variables. This provides
a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be
to change all three of these values into something with a much finer resolution
(either microseconds or nanoseconds), though we may want to make the resolution
configurable so that on lower powered machines we could still use the much
cheaper ticks variable.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D24902
2020-06-08 11:48:07 +00:00
Michael Tuexen
70486b27ae Retire SCTP_SO_LOCK_TESTING.
This was intended to test the locking used in the MacOS X kernel on a
FreeBSD system, to make use of WITNESS and other debugging infrastructure.
This hasn't been used for ages, to take it out to reduce the #ifdef
complexity.

MFC after:		1 week
2020-06-07 14:39:20 +00:00
Michael Tuexen
3f53d62236 Fix typo in comment.
Submitted by Orgad Shaneh for the userland stack.
MFC after:		1 week
2020-06-06 21:26:34 +00:00
Michael Tuexen
2cf3347109 Non-functional changes due to cleanup (upstream removing of Panda support)
of the code

MFC after:		1 week
2020-06-06 18:20:09 +00:00
Randall Stewart
2cf21ae559 We should never allow either the broadcast or IN_ADDR_ANY to be
connected to or sent to. This was fond when working with Michael
Tuexen and Skyzaller. Skyzaller seems to want to use either of
these two addresses to connect to at times. And it really is
an error to do so, so lets not allow that behavior.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D24852
2020-06-03 14:16:40 +00:00
Randall Stewart
f1ea4e4120 This fixes a couple of skyzaller crashes. Most
of them have to do with TFO. Even the default stack
had one of the issues:

1) We need to make sure for rack that we don't advance
   snd_nxt beyond iss when we are not doing fast open. We
   otherwise can get a bunch of SYN's sent out incorrectly
   with the seq number advancing.
2) When we complete the 3-way handshake we should not ever
   append to reassembly if the tlen is 0, if TFO is enabled
   prior to this fix we could still call the reasemmbly. Note
   this effects all three stacks.
3) Rack like its cousin BBR should track if a SYN is on a
   send map entry.
4) Both bbr and rack need to only consider len incremented on a SYN
   if the starting seq is iss, otherwise we don't increment len which
   may mean we return without adding a sendmap entry.

This work was done in collaberation with Michael Tuexen, thanks for
all the testing!
Sponsored by:	Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D25000
2020-06-03 14:07:31 +00:00
Michael Tuexen
d442a65733 Restrict enabling TCP-FASTOPEN to end-points in CLOSED or LISTEN state
Enabling TCP-FASTOPEN on an end-point which is in a state other than
CLOSED or LISTEN, is a bug in the application. So it should not work.
Also the TCP code does not (and needs not to) handle this.
While there, also simplify the setting of the TF_FASTOPEN flag.

This issue was found by running syzkaller.

Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D25115
2020-06-03 13:51:53 +00:00
Alexander V. Chernikov
da187ddb3d * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D25067
2020-06-01 20:49:42 +00:00
Alexander V. Chernikov
e7403d0230 Revert r361704, it accidentally committed merged D25067 and D25070. 2020-06-01 20:40:40 +00:00
Alexander V. Chernikov
79674562b8 * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:	ae
Differential Revision: https://reviews.freebsd.org/D25067
2020-06-01 20:32:02 +00:00
Alexander V. Chernikov
a37a5246ca Use fib[46]_lookup() in mtu calculations.
fib[46]_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Differential Revision:	https://reviews.freebsd.org/D24974
2020-05-28 08:00:08 +00:00
Alexander V. Chernikov
3553b3007f Switch ip_output/icmp_reflect rt lookup calls with fib4_lookup.
fib4_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24976
2020-05-28 07:31:53 +00:00
Alexander V. Chernikov
7bfc98af12 Switch gif(4) path verification to fib[46]_check_urfp().
fibX_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Use specialized fib[46]_check_urpf() from newer KPI instead,
to allow removal of older KPI.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24978
2020-05-28 07:26:18 +00:00
Emmanuel Vadot
77c68315f6 bbr: Use arc4random_uniform from libkern.
This unbreak LINT build

Reported by:	jenkins, melifaro
2020-05-23 19:52:20 +00:00
Alexander V. Chernikov
4d2c2509f2 Move <add|del|change>_route() functions to route_ctl.c in preparation of
multipath control plane changed described in D24141.

Currently route.c contains core routing init/teardown functions, route table
 manipulation functions and various helper functions, resulting in >2KLOC
 file in total. This change moves most of the route table manipulation parts
 to a dedicated file, simplifying planned multipath changes and making
 route.c more manageable.

Differential Revision:	https://reviews.freebsd.org/D24870
2020-05-23 19:06:57 +00:00
Richard Scheffenegger
e68cde59c3 DCTCP: update alpha only once after loss recovery.
In mixed ECN marking and loss scenarios it was found, that
the alpha value of DCTCP is updated two times. The second
update happens with freshly initialized counters indicating
to ECN loss. Overall this leads to alpha not adjusting as
quickly as expected to ECN markings, and therefore lead to
excessive loss.

Reported by:	Cheng Cui
Reviewed by:	chengc_netapp.com, rrs, tuexen (mentor)
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24817
2020-05-21 21:42:49 +00:00
Richard Scheffenegger
af2fb894c9 With RFC3168 ECN, CWR SHOULD only be sent with new data
Overly conservative data receivers may ignore the CWR flag
on other packets, and keep ECE latched. This can result in
continous reduction of the congestion window, and very poor
performance when ECN is enabled.

Reviewed by:	rgrimes (mentor), rrs
Approved by:	rgrimes (mentor), tuexen (mentor)
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23364
2020-05-21 21:33:15 +00:00
Richard Scheffenegger
8e0511652b Retain only mutually supported TCP options after simultaneous SYN
When receiving a parallel SYN in SYN-SENT state, remove all the
options only we supported locally before sending the SYN,ACK.

This addresses a consistency issue on parallel opens.

Also, on such a parallel open, the stack could be coaxed into
running with timestamps enabled, even if administratively disabled.

Reviewed by:	tuexen (mentor)
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23371
2020-05-21 21:26:21 +00:00
Richard Scheffenegger
6e16d87751 Handle ECN handshake in simultaneous open
While testing simultaneous open TCP with ECN, found that
negotiation fails to arrive at the expected final state.

Reviewed by:	tuexen (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23373
2020-05-21 21:15:25 +00:00
Mark Johnston
591b09b486 Define a module version for accept filter modules.
Otherwise accept filters compiled into the kernel do not preempt
preloaded accept filter modules.  Then, the preloaded file registers its
accept filter module before the kernel, and the kernel's attempt fails
since duplicate accept filter list entries are not permitted.  This
causes the preloaded file's module to be released, since
module_register_init() does a lookup by name, so the preloaded file is
unloaded, and the accept filter's callback points to random memory since
preload_delete_name() unmaps the file on x86 as of r336505.

Add a new ACCEPT_FILTER_DEFINE macro which wraps the accept filter and
module definitions, and ensures that a module version is defined.

PR:		245870
Reported by:	Thomas von Dein <freebsd@daemon.de>
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-05-19 18:35:08 +00:00
Michael Tuexen
999f86d67d Replace snprintf() by SCTP_SNPRINTF() and let SCTP_SNPRINTF() map
to snprintf() on FreeBSD. This allows to check for failures of snprintf()
on platforms other than FreeBSD kernel.
2020-05-19 07:23:35 +00:00
Michael Tuexen
821bae7cf3 Revert r361209:
cem noted that on FreeBSD snprintf() can not fail and code should not
check for that.

A followup commit will replace the usage of snprintf() in the SCTP
sources with a variadic macro SCTP_SNPRINTF, which will simply map to
snprintf() on FreeBSD and do a checking similar to r361209 on
other platforms.
2020-05-19 07:21:11 +00:00
Mike Karels
2fdbcbea76 Fix NULL-pointer bug from r361228.
Note that in_pcb_lport and in_pcb_lport_dest can be called with a NULL
local address for IPv6 sockets; handle it.  Found by syzkaller.

Reported by:	cem
MFC after:	1 month
2020-05-19 01:05:13 +00:00
Mike Karels
2510235150 Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by:	tuexen (rscheff, rrs previous version)
MFC after:	1 month
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D24781
2020-05-18 22:53:12 +00:00
Michael Tuexen
6863ab0b8b Remove assignment without effect.
MFC after:	3 days
2020-05-18 19:48:38 +00:00
Michael Tuexen
9e8c9c9ef6 Don't check an unsigned variable for being negative.
MFC after:		3 days.
2020-05-18 19:35:46 +00:00
Michael Tuexen
04ce5df9c9 Remove redundant assignment.
MFC after:		3 days
2020-05-18 19:23:01 +00:00
Michael Tuexen
bca1802890 Cleanup, no functional change intended.
MFC after:		3 days
2020-05-18 18:42:43 +00:00
Michael Tuexen
6395219ae3 Avoid an integer underflow.
MFC after:		3 days
2020-05-18 18:32:58 +00:00
Michael Tuexen
60017c8e88 Remove redundant check.
MFC after:		3 days
2020-05-18 18:27:10 +00:00
Michael Tuexen
88116b7e4d Fix logical condition by looking at usecs.
This issue was found by cpp-check running on the userland stack.

MFC after:		3 days
2020-05-18 15:02:15 +00:00
Michael Tuexen
00023d8a87 Whitespace change.
MFC after:		3 days
2020-05-18 15:00:18 +00:00
Michael Tuexen
e708e2a4f4 Handle failures of snprintf().
MFC after:		3 days
2020-05-18 10:07:01 +00:00
Michael Tuexen
da8c34c382 Non-functional changes, cleanups.
MFC after:		3 days
2020-05-17 22:31:38 +00:00
Alexander V. Chernikov
174fb9dbb1 Remove redundant checks for nhop validity.
Currently NH_IS_VALID() simly aliases to RT_LINK_IS_UP(), so we're
 checking the same thing twice.

In the near future the implementation of this check will be simpler,
 as there are plans to introduce control-plane interface status monitoring
 similar to ipfw interface tracker.
2020-05-17 15:32:36 +00:00
Michael Tuexen
daf143413a Ensure that an stcb is not dereferenced when it is about to be
freed.
This issue was found by SYZKALLER.

MFC after:		3 days
2020-05-16 19:26:39 +00:00
Ed Maste
65a1d63665 libalias: retire cuseeme support
The CU-SeeMe videoconferencing client and associated protocol is at this
point a historical artifact; there is no need to retain support for this
protocol today.

Reviewed by:	philip, markj, allanjude
Relnotes:	Yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24790
2020-05-16 02:29:10 +00:00
Michael Tuexen
e240ce42bf Allow only IPv4 addresses in sendto() for TCP on AF_INET sockets.
This problem was found by looking at syzkaller reproducers for some other
problems.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24831
2020-05-15 14:06:37 +00:00
Randall Stewart
777b88d60f This fixes several skyzaller issues found with the
help of Michael Tuexen. There was some accounting
errors with TCPFO for bbr and also for both rack
and bbr there was a FO case where we should be
jumping to the just_return_nolock label to
exit instead of returning 0. This of course
caused no timer to be running and thus the
stuck sessions.

Reported by: Michael Tuexen and Skyzaller
Sponsored by: Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D24852
2020-05-15 14:00:12 +00:00
Ed Maste
46701f31be libalias: fix potential memory disclosure from ftp module
admbugs:	956
Submitted by:	markj
Reported by:	Vishnu Dev TJ working with Trend Micro Zero Day Initiative
Security:	FreeBSD-SA-20:13.libalias
Security:	CVE-2020-7455
Security:	ZDI-CAN-10849
2020-05-12 16:38:28 +00:00
Ed Maste
6461c83e09 libalias: validate packet lengths before accessing headers
admbugs:	956
Submitted by:	ae
Reported by:	Lucas Leong (@_wmliang_) of Trend Micro Zero Day Initiative
Reported by:	Vishnu working with Trend Micro Zero Day Initiative
Security:	FreeBSD-SA-20:12.libalias
2020-05-12 16:33:04 +00:00
Michael Tuexen
86fd36c502 Fix a copy and paste error introduced in r360878.
Reported-by:		syzbot+a0863e972771f2f0d4b3@syzkaller.appspotmail.com
Reported-by:		syzbot+4481757e967ba83c445a@syzkaller.appspotmail.com
MFC after:		3 days
2020-05-11 22:47:20 +00:00
Alexander V. Chernikov
1d1a743e9f Fix NOINET[6] build by using af-independent route lookup function.
Reported by:	rpokala
2020-05-11 20:41:03 +00:00
Andrew Gallatin
6043ac201a Ktls: never skip stamping tags for NIC TLS
The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.

When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped.  Failure
to stamp a NIC TLS packet will result in crypto
issues.

Reviewed by:	hselasky, rrs
Sponsored by:	Netflix, Mellanox
2020-05-11 19:17:33 +00:00
Michael Tuexen
83ed508055 Ensure that the SCTP iterator runs with an stcb and inp, which belong to
each other.

Reported by:	syzbot+82d39d14f2f765e38db0@syzkaller.appspotmail.com
MFC after:	3 days
2020-05-10 22:54:30 +00:00
Michael Tuexen
9d176904ae Remove trailing whitespace. 2020-05-10 17:43:42 +00:00
Michael Tuexen
efd5e69291 Ensure that we have a path when starting the T3 RXT timer.
Reported by:	syzbot+f2321629047f89486fa3@syzkaller.appspotmail.com
MFC after:	3 days
2020-05-10 17:19:19 +00:00
Michael Tuexen
8123bbf186 Only drop DATA chunk with lower priorities as specified in RFC 7496.
This issue was found by looking at a reproducer generated by syzkaller.

MFC after:		3 days
2020-05-10 10:03:10 +00:00
Randall Stewart
b1ddcbc62c When in the SYN-SENT state bbr and rack will not properly send an ACK but instead start the D-ACK timer. This
causes so_reuseport_lb_test to fail since it slows down how quickly the program runs until the timeout occurs
and fails the test

Sponsored by: Netflix inc.
Differential Revision:	https://reviews.freebsd.org/D24747
2020-05-07 20:29:38 +00:00
Randall Stewart
8717b8f1bb NF has an internal option that changes the tcp_mcopy_m routine slightly (has
a few extra arguments). Recently that changed to only have one arg extra so
that two ifdefs around the call are no longer needed. Lets take out the
extra ifdef and arg.

Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D24736
2020-05-07 10:46:02 +00:00
Michael Tuexen
cb9fb7b2cb Avoid underflowing a variable, which would result in taking more
data from the stream queues then needed.

Thanks to Timo Voelker for finding this bug and providing a fix.

MFC after:		3 days
2020-05-05 19:54:30 +00:00
Michael Tuexen
d3c3d6f99c Fix the computation of the numbers of entries of the mapping array to
look at when generating a SACK. This was wrong in case of sequence
numbers wrap arounds.

Thanks to Gwenael FOURRE for reporting the issue for the userland stack:
https://github.com/sctplab/usrsctp/issues/462
MFC after:		3 days
2020-05-05 17:52:44 +00:00
Michael Tuexen
51a5392297 Add net epoch support back, which was taken out by accident in
https://svnweb.freebsd.org/changeset/base/360639

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24694
2020-05-04 23:05:11 +00:00
Randall Stewart
570045a0fc This fixes two issues found by ankitraheja09@gmail.com
1) When BBR retransmits the syn it was messing up the snd_max
2) When we need to send a RST we might not send it when we should

Reported by:	ankitraheja09@gmail.com
Sponsored by:  Netflix.com
Differential Revision: https://reviews.freebsd.org/D24693
2020-05-04 23:02:58 +00:00
Michael Tuexen
7985fd7e76 Enter the net epoch before calling the output routine in TCP BBR.
This was only triggered when setting the IPPROTO_TCP level socket
option TCP_DELACK.
This issue was found by runnning an instance of SYZKALLER.
Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24690
2020-05-04 22:02:49 +00:00
Randall Stewart
963fb2ad94 This commit brings things into sync with the advancements that
have been made in rack and adds a few fixes in BBR. This also
removes any possibility of incorrectly doing OOB data the stacks
do not support it. Should fix the skyzaller crashes seen in the
past. Still to fix is the BBR issue just reported this weekend
with the SYN and on sending a RST. Note that this version of
rack can now do pacing as well.

Sponsored by:Netflix Inc
Differential Revision:https://reviews.freebsd.org/D24576
2020-05-04 20:28:53 +00:00
Randall Stewart
d3b6c96b7d Adjust the fb to have a way to ask the underlying stack
if it can support the PRUS option (OOB). And then have
the new function call that to validate and give the
correct error response if needed to the user (rack
and bbr do not support obsoleted OOB data).

Sponsoered by: Netflix Inc.
Differential Revision:	 https://reviews.freebsd.org/D24574
2020-05-04 20:19:57 +00:00
Alexander V. Chernikov
9e02229580 Remove now-unused rt_ifp,rt_ifa,rt_gateway,rt_mtu rte fields.
After converting routing subsystem customers to use nexthop objects
 defined in r359823, some fields in struct rtentry became unused.

This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
 along with the code initializing and updating these fields.

Cleanup of the remaining fields will be addressed by D24669.

This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
 resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
 nexthop and tries to update rte nhop pointer. Only last part is done under
 WLOCK.
In the hypothetical scenarious where multiple rtsock clients
 repeatedly issue RTM_CHANGE requests for the same route, route may get
 updated between read and update operation. This is addressed by retrying
 the operation multiple (3) times before returning failure back to the
 caller.

Differential Revision:	https://reviews.freebsd.org/D24666
2020-05-04 14:31:45 +00:00
Gleb Smirnoff
61664ee700 Step 4.2: start divorce of M_EXT and M_EXTPG
They have more differencies than similarities. For now there is lots
of code that would check for M_EXT only and work correctly on M_EXTPG
buffers, so still carry M_EXT bit together with M_EXTPG. However,
prepare some code for explicit check for M_EXTPG.

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:37:16 +00:00
Gleb Smirnoff
6edfd179c8 Step 4.1: mechanically rename M_NOMAP to M_EXTPG
Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:21:11 +00:00
Gleb Smirnoff
7b6c99d08d Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:12:56 +00:00
Richard Scheffenegger
14558b9953 Introduce a lower bound of 2 MSS to TCP Cubic.
Running TCP Cubic together with ECN could end up reducing cwnd down to 1 byte, if the
receiver continously sets the ECE flag, resulting in very poor transmission speeds.

In line with RFC6582 App. B, a lower bound of 2 MSS is introduced, as well as a typecast
to prevent any potential integer overflows during intermediate calculation steps of the
adjusted cwnd.

Reported by:	Cheng Cui
Reviewed by:	tuexen (mentor)
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23353
2020-04-30 11:11:28 +00:00
Richard Scheffenegger
9028b6e0d9 Prevent premature shrinking of the scaled receive window
which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets.

Packets with old sequence numbers are ignored and not used to update the send window size.
This might cause the TCP session to hang indefinitely under some circumstances.

Reported by:	Cui Cheng
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24515
2020-04-29 22:01:33 +00:00
Richard Scheffenegger
b2ade6b166 Correctly set up the initial TCP congestion window in all cases,
by not including the SYN bit sequence space in cwnd related calculations.
Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead.

This fixes an off-by-one conformance issue with regular TCP sessions not
using Appropriate Byte Counting (RFC3465), sending one more packet during
the initial window than expected.

PR:		235256
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D19000
2020-04-29 21:48:52 +00:00
Alexander V. Chernikov
e7d8af4f65 Move route_temporal.c and route_var.h to net/route.
Nexthop objects implementation, defined in r359823,
 introduced sys/net/route directory intended to hold all
 routing-related code. Move recently-introduced route_temporal.c and
 private route_var.h header there.

Differential Revision:	https://reviews.freebsd.org/D24597
2020-04-28 19:14:09 +00:00
Alexander V. Chernikov
4043ee3cd7 Convert rtalloc_mpath_fib() users to the new KPI.
New fib[46]_lookup() functions support multipath transparently.
Given that, switch the last rtalloc_mpath_fib() calls to
 dib4_lookup() and eliminate the function itself.

Note: proper flowid generation (especially for the outbound traffic) is a
 bigger topic and will be handled in a separate review.
This change leaves flowid generation intact.

Differential Revision:	https://reviews.freebsd.org/D24595
2020-04-28 08:06:56 +00:00
Alexander V. Chernikov
1b0051bada Eliminate now-unused parts of old routing KPI.
r360292 switched most of the remaining routing customers to a new KPI,
 leaving a bunch of wrappers for old routing lookup functions unused.

Remove them from the tree as a part of routing cleanup.

Differential Revision:	https://reviews.freebsd.org/D24569
2020-04-28 07:25:34 +00:00
John Baldwin
f1f9347546 Initial support for kernel offload of TLS receive.
- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
  authentication algorithms and keys as well as the initial sequence
  number.

- When reading from a socket using KTLS receive, applications must use
  recvmsg().  Each successful call to recvmsg() will return a single
  TLS record.  A new TCP control message, TLS_GET_RECORD, will contain
  the TLS record header of the decrypted record.  The regular message
  buffer passed to recvmsg() will receive the decrypted payload.  This
  is similar to the interface used by Linux's KTLS RX except that
  Linux does not return the full TLS header in the control message.

- Add plumbing to the TOE KTLS interface to request either transmit
  or receive KTLS sessions.

- When a socket is using receive KTLS, redirect reads from
  soreceive_stream() into soreceive_generic().

- Note that this interface is currently only defined for TLS 1.1 and
  1.2, though I believe we will be able to reuse the same interface
  and structures for 1.3.
2020-04-27 23:17:19 +00:00
John Baldwin
ec1db6e13d Add the initial sequence number to the TLS enable socket option.
This will be needed for KTLS RX.

Reviewed by:	gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D24451
2020-04-27 22:31:42 +00:00
Randall Stewart
e570d231f4 This change does a small prepratory step in getting the
latest rack and bbr in from the NF repo. When those come
in the OOB data handling will be fixed where Skyzaller crashes.

Differential Revision:	https://reviews.freebsd.org/D24575
2020-04-27 16:30:29 +00:00
Alexander V. Chernikov
55f57ca9ac Convert debugnet to the new routing KPI.
Introduce new fib[46]_lookup_debugnet() functions serving as a
special interface for the crash-time operations. Underlying
implementation will try to return lookup result if
datastructures are not corrupted, avoding locking.

Convert debugnet to use fib4_lookup_debugnet() and switch it
to use nexthops instead of rtentries.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D24555
2020-04-26 18:42:38 +00:00
Alexander V. Chernikov
17cb6ddba8 Fix order of arguments in fib[46]_lookup calls in SCTP.
r360292 introduced the wrong order, resulting in returned
 nhops not being referenced, despite the fact that references
 were requested. That lead to random GPF after using SCTP sockets.

Special defined macro like IPV[46]_SCOPE_GLOBAL will be introduced
 soon to reduce the chance of putting arguments in wrong order.

Reported-by: syzbot+5c813c01096363174684@syzkaller.appspotmail.com
2020-04-26 13:02:42 +00:00
Alexander V. Chernikov
454d389645 Fix LINT build #2 after r360292.
Pointyhat to: melifaro
2020-04-25 11:35:38 +00:00
Alexander V. Chernikov
ac99fd86d4 Fix LINT build broken by r360292. 2020-04-25 10:31:56 +00:00
Alexander V. Chernikov
983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Alexander V. Chernikov
9e88f47c8f Unbreak LINT-NOINET[6] builds broken in r360191.
Reported by:	np
2020-04-23 06:55:33 +00:00
Michael Tuexen
8262311cbe Improve input validation when processing AUTH chunks.
Thanks to Natalie Silvanovich from Google for finding and reporting the
issue found by her in the SCTP userland stack.

MFC after:		3 days
X-MFC with:		https://svnweb.freebsd.org/changeset/base/360193
2020-04-22 21:22:33 +00:00
Michael Tuexen
97feba891d Improve input validation when processing AUTH chunks.
Thanks to Natalie Silvanovich from Google for finding and reporting the
issue found by her in the SCTP userland stack.

MFC after:		3 days
2020-04-22 12:47:46 +00:00
Alexander V. Chernikov
8d6708ba80 Convert TOE routing lookups to the new routing KPI.
Reviewed by:	np
Differential Revision:	https://reviews.freebsd.org/D24388
2020-04-22 07:53:43 +00:00
Richard Scheffenegger
bb410f9ff2 revert rS360143 - Correctly set up initial cwnd
due to syzkaller panics found

Reported by:	tuexen
Approved by:	tuexen (mentor)
Sponsored by:	NetApp, Inc.
2020-04-22 00:16:42 +00:00
Richard Scheffenegger
73b7696693 Correctly set up the initial TCP congestion window
in all cases, by adjust snd_una right after the
connection initialization, to include the one byte
in sequence space occupied by the SYN bit.

This does not change the regular ACK processing,
while making the BYTES_THIS_ACK macro to work properly.

PR:		235256
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D19000
2020-04-21 13:05:44 +00:00
Jonathan T. Looney
5d6e356cb0 Avoid calling protocol drain routines more than once per reclamation event.
mb_reclaim() calls the protocol drain routines for each protocol in each
domain. Some protocols exist in more than one domain and share drain
routines. In the case of SCTP, it also uses the same drain routine for
its SOCK_SEQPACKET and SOCK_STREAM entries in the same domain.

On systems with INET, INET6, and SCTP all defined, mb_reclaim() calls
sctp_drain() four times. On systems with INET and INET6 defined,
mb_reclaim() calls tcp_drain() twice. mb_reclaim() is the only in-tree
caller of the pr_drain protocol entry.

Eliminate this duplication by ensuring that each pr_drain routine is only
specified for one protocol entry in one domain.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24418
2020-04-16 20:17:24 +00:00
Alexander V. Chernikov
539642a29d Add nhop parameter to rti_filter callback.
One of the goals of the new routing KPI defined in r359823 is to
 entirely hide`struct rtentry` from the consumers. It will allow to
 improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
 struct rtentry field accesses.

Additionally, with the followup multipath changes, single rtentry can point
 to multiple nexthops.

With that in mind, convert rti_filter callback used when traversing the
 routing table to accept pair (rt, nhop) instead of nexthop.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24440
2020-04-16 17:20:18 +00:00
Richard Scheffenegger
d7ca3f780d Reduce default TCP delayed ACK timeout to 40ms.
Reviewed by:	kbowling, tuexen
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23281
2020-04-16 15:59:23 +00:00
Alexander V. Chernikov
9ac7c6cfed Convert IP/IPv6 forwarding, ICMP processing and IP PCB laddr selection to
the new routing KPI.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24245
2020-04-14 23:06:25 +00:00
Michael Tuexen
b89af8e16d Improve the TCP blackhole detection. The principle is to reduce the
MSS in two steps and try each candidate two times. However, if two
candidates are the same (which is the case in TCP/IPv6), this candidate
was tested four times. This patch ensures that each candidate actually
reduced the MSS and is only tested 2 times. This reduces the time window
of missclassifying a temporary outage as an MTU issue.

Reviewed by:		jtl
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24308
2020-04-14 16:35:05 +00:00
Andrew Gallatin
23feb56348 KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by:	hselasky, jhb, rrs, glebius (early version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24213
2020-04-14 14:46:06 +00:00
Alexander V. Chernikov
6722086045 Plug netmask NULL check during route addition causing kernel panic.
This bug was introduced by the r359823.

Reported by:	hselasky
2020-04-14 13:12:22 +00:00
Kristof Provost
1d126e9b94 carp: Widen epoch coverage
Fix panics related to calling code which expects to be running inside
the NET_EPOCH from outside that epoch.

This leads to panics (with INVARIANTS) such as this one:

    panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/netinet/if_ether.c:373
    cpuid = 7
    time = 1586095719
    KDB: stack backtrace:
    db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0090819700
    vpanic() at vpanic+0x182/frame 0xfffffe0090819750
    panic() at panic+0x43/frame 0xfffffe00908197b0
    arprequest_internal() at arprequest_internal+0x59e/frame 0xfffffe00908198c0
    arp_announce_ifaddr() at arp_announce_ifaddr+0x20/frame 0xfffffe00908198e0
    carp_master_down_locked() at carp_master_down_locked+0x10d/frame 0xfffffe0090819910
    carp_master_down() at carp_master_down+0x79/frame 0xfffffe0090819940
    softclock_call_cc() at softclock_call_cc+0x13f/frame 0xfffffe00908199f0
    softclock() at softclock+0x7c/frame 0xfffffe0090819a20
    ithread_loop() at ithread_loop+0x279/frame 0xfffffe0090819ab0
    fork_exit() at fork_exit+0x80/frame 0xfffffe0090819af0
    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0090819af0
    --- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Widen the NET_EPOCH to cover the relevant (callback / task) code.

Differential Revision:	https://reviews.freebsd.org/D24302
2020-04-12 16:09:21 +00:00
Alexander V. Chernikov
a666325282 Introduce nexthop objects and new routing KPI.
This is the foundational change for the routing subsytem rearchitecture.
 More details and goals are available in https://reviews.freebsd.org/D24141 .

This patch introduces concept of nexthop objects and new nexthop-based
 routing KPI.

Nexthops are objects, containing all necessary information for performing
 the packet output decision. Output interface, mtu, flags, gw address goes
 there. For most of the cases, these objects will serve the same role as
 the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
 multiple BGP full-views, as these objects will be shared between routing
 entries. This allows to store more information in the nexthop.

New KPI:

struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);

These 2 function are intended to replace all all flavours of
 <in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions  and the previous
 fib[46]-generation functions.

Upon successful lookup, they return nexthop object which is guaranteed to
 exist within current NET_EPOCH. If longer lifetime is desired, one can
 specify NHR_REF as a flag and get a referenced version of the nexthop.
 Reference semantic closely resembles rtentry one, allowing sed-style conversion.

Additionally, another 2 functions are introduced to support uRPF functionality
 inside variety of our firewalls. Their primary goal is to hide the multipath
 implementation details inside the routing subsystem, greatly simplifying
 firewalls implementation:

int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);

All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
 embedding and allowing to support IPv4 link-locals in the future.

Structure changes:
 * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
 * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.

Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
 decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
 kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.

More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232

Reviewed by:	ae,glebius(initial version)
Differential Revision:	https://reviews.freebsd.org/D24232
2020-04-12 14:30:00 +00:00
Michael Tuexen
07ddae2822 Revert https://svnweb.freebsd.org/changeset/base/359809
The intended change was
	sp->next.tqe_next = NULL;
	sp->next.tqe_prev = NULL;
which doesn't fix the issue I'm seeing and the committed fix is
not the intended fix due to copy-and-paste.

Thanks a lot to Conrad Meyer for making me aware of the problem.

Reported by:		cem
2020-04-12 09:31:36 +00:00
Michael Tuexen
9803dbb3ea Zero out pointers for consistency.
This was found by running syzkaller on an INVARIANTS kernel.

MFC after:		3 days
2020-04-11 20:36:54 +00:00
Alexander V. Chernikov
4684d3cbcb Remove per-AF radix_mpath initializtion functions.
Split their functionality by moving random seed allocation
 to SYSINIT and calling (new) generic multipath function from
 standard IPv4/IPv5 RIB init handlers.

Differential Revision:	https://reviews.freebsd.org/D24356
2020-04-11 07:37:08 +00:00
Warner Losh
28540ab153 Fix copyright year and eliminate the obsolete all rights reserved line.
Reviewed by: rrs@
2020-04-08 17:55:45 +00:00
Michael Tuexen
f4cb790a35 Do more argument validation under INVARIANTS when starting/stopping
an SCTP timer.

MFC after:		1 week
2020-04-06 13:58:13 +00:00