Commit Graph

7010 Commits

Author SHA1 Message Date
Michael Tuexen
47384244f9 Fix an issue I introuced in r367530: tcp_twcheck() can be called
with to == NULL for SYN segments. So don't assume tp != NULL.
Thanks to jhb@ for reporting and suggesting a fix.

PR:			250499
MFC after:		1 week
XMFC-with:		r367530
Sponsored by:		Netflix, Inc.
2020-11-20 13:00:28 +00:00
Ed Maste
360d1232ab ip_fastfwd: style(9) tidy for r367628
Discussed with:	gnn
MFC with:	r367628
2020-11-13 18:25:07 +00:00
George V. Neville-Neil
d65d6d5aa9 Followup pointed out by ae@ 2020-11-13 13:07:44 +00:00
George V. Neville-Neil
8ad114c082 An earlier commit effectively turned out the fast forwading path
due to its lack of support for ICMP redirects. The following commit
adds redirects to the fastforward path, again allowing for decent
forwarding performance in the kernel.

Reviewed by: ae, melifaro
Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate")
2020-11-12 21:58:47 +00:00
Michael Tuexen
283c76c7c3 RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
  the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
  for the timestamp option has not been negotiated.
This patch enforces the above.

PR:			250499
Reviewed by:		gnn, rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc
Differential Revision:	https://reviews.freebsd.org/D27148
2020-11-09 21:49:40 +00:00
Michael Tuexen
e597bae4ee Fix a potential use-after-free bug introduced in
https://svnweb.freebsd.org/changeset/base/363046

Thanks to Taylor Brandstetter for finding this issue using fuzz testing
and reporting it in https://github.com/sctplab/usrsctp/issues/547
2020-11-09 13:12:07 +00:00
Mitchell Horne
b02c4e5c78 igmp: convert igmpstat to use PCPU counters
Currently there is no locking done to protect this structure. It is
likely okay due to the low-volume nature of IGMP, but allows for
the possibility of underflow. This appears to be one of the only
holdouts of the conversion to counter(9) which was done for most
protocol stat structures around 2013.

This also updates the visibility of this stats structure so that it can
be consumed from elsewhere in the kernel, consistent with the vast
majority of VNET_PCPUSTAT structures.

Reviewed by:	kp
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27023
2020-11-08 18:49:23 +00:00
Richard Scheffenegger
4d0770f172 Prevent premature SACK block transmission during loss recovery
Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.

Reported by:	chengc_netapp.com
Reviewed by:	rrs, jtl
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24237
2020-11-08 18:47:05 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
John Baldwin
98d7a8d9cd Call m_snd_tag_rele() to free send tags.
Send tags are refcounted and if_snd_tag_free() is called by
m_snd_tag_rele() when the last reference is dropped on a send tag.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26995
2020-10-29 22:18:56 +00:00
John Baldwin
7552deb2a0 Remove an extra if_ref().
In r348254, if_snd_tag_alloc() routines were changed to bump the ifp
refcount via m_snd_tag_init().  This function wasn't in the tree at
the time and wasn't updated for the new semantics, so was still doing
a separate bump after if_snd_tag_alloc() returned.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26999
2020-10-29 22:16:59 +00:00
John Baldwin
aebfdc1fec Store the new send tag in the right place.
r350501 added the 'st' parameter, but did not pass it down to
if_snd_tag_alloc().

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26997
2020-10-29 22:14:34 +00:00
John Baldwin
521eac97f3 Support hardware rate limiting (pacing) with TLS offload.
- Add a new send tag type for a send tag that supports both rate
  limiting (packet pacing) and TLS offload (mostly similar to D22669
  but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
  connection already has a pacing rate.  If so, allocate a tag that
  supports both rate limiting and TLS offload rather than a plain TLS
  offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
  set the rate in the TCP control block inp and then reset the TLS
  send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
  send tag.  This allocates the TLS send tag asynchronously from a
  task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
  send tag.  If the send tag is only a plain TLS send tag, assume we
  failed to allocate a TLS ratelimit tag (either during the
  TCP_TXTLS_ENABLE socket option, or during the send tag reset
  triggered by ktls_output_eagain) and ignore the new rate.  If the
  send tag is a ratelimit TLS send tag, change the rate on the TLS tag
  and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
  so that the routines in tcp_ratelimit can safely dereference the
  pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
  administrative controls in ifconfig(8).  TLS rate limit tags are
  only allocated if this capability is enabled.  Note that TLS offload
  (whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by:	gallatin, hselasky
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26691
2020-10-29 00:23:16 +00:00
John Baldwin
ce39811544 Save the current TCP pacing rate in t_pacing_rate.
Reviewed by:	gallatin, gnn
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26875
2020-10-29 00:03:19 +00:00
Richard Scheffenegger
3767427354 TCP Cubic: improve reaction to (and rollback from) RTO
1. fix compliancy issue of CUBIC RTO handling according to RFC8312 section 4.7
2. add CUBIC CC_RTO_ERR handling

Submitted by:	chengc_netapp.com
Reviewed by:	rrs, tuexen, rscheff
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26808
2020-10-24 16:11:46 +00:00
Richard Scheffenegger
39a12f0178 tcp: move cwnd and ssthresh updates into cc modules
This will pave the way of setting ssthresh differently in TCP CUBIC, according
to RFC8312 section 4.7.

No functional change, only code movement.

Submitted by:	chengc_netapp.com
Reviewed by:	rrs, tuexen, rscheff
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26807
2020-10-24 16:09:18 +00:00
Mark Johnston
4caea9b169 icmp6: Count packets dropped due to an invalid hop limit
Pad the icmp6stat structure so that we can add more counters in the
future without breaking compatibility again, last done in r358620.
Annotate the rarely executed error paths with __predict_false while
here.

Reviewed by:	bz, melifaro
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26578
2020-10-19 17:07:19 +00:00
Alexander V. Chernikov
0c325f53f1 Implement flowid calculation for outbound connections to balance
connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
 and outbound traffic. Current code fills mbuf flowid from inp_flowid
 for connection-oriented sockets. However, inp_flowid is currently
 not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
 for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
 system.

Reviewed by:	glebius (previous version),ae
Differential Revision:	https://reviews.freebsd.org/D26523
2020-10-18 17:15:47 +00:00
Alexander V. Chernikov
fa8b3fcb4c Simplify NET_EPOCH_EXIT in inp_join_group().
Suggested by:	kib
2020-10-18 12:03:36 +00:00
Alexander V. Chernikov
337418adf1 Fix sleepq_add panic happening with too wide net epoch in mcast control.
PR:		250413
Reported by:	Christopher Hall <hsw at bitmark.com>
Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D26827
2020-10-17 20:33:09 +00:00
Michael Tuexen
a92d501617 Improve the handling of cookie life times.
The staleness reported in an error cause is in us, not ms.
Enforce limits on the life time via sysct; and socket options
consistently. Update the description of the sysctl variable to
use the right unit. Also do some minor cleanups.
This also fixes an interger overflow issue if the peer can
modify the cookie. This was reported by Felix Weinrank by fuzz testing
the userland stack and in
https://oss-fuzz.com/testcase-detail/4800394024452096

MFC after:		3 days
2020-10-16 10:44:48 +00:00
Andrey V. Elsukov
6952c3e1ac Implement SIOCGIFALIAS.
It is lightweight way to check if an IPv4 address exists.

Submitted by:	Roy Marples
Reviewed by:	gnn, melifaro
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26636
2020-10-14 09:22:54 +00:00
Andrey V. Elsukov
3f740d4393 Join to AllHosts multicast group again when adding an existing IPv4 address.
When SIOCAIFADDR ioctl configures an IPv4 address that is already exist,
it removes old ifaddr. When this IPv4 address is only one configured on
the interface, this also leads to leaving from AllHosts multicast group.
Then an address is added again, but due to the bug, this doesn't lead
to joining to AllHosts multicast group.

Submitted by:	yannis.planus_alstomgroup.com
Reviewed by:	gnn
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D26757
2020-10-13 19:34:36 +00:00
Bjoern A. Zeeb
506512b170 ip_mroute: fix the viftable export sysctl
It seems that in r354857 I got more than one thing wrong.
Convert the SYSCTL_OPAQUE to a SYSCTL_PROC to properly export the these
days allocated and not longer static per-vnet viftable array.
This fixes a problem with netstat -g which would show bogus information
for the IPv4 Virtual Interface Table.

PR:		246626
Reported by:	Ozkan KIRIK (ozkan.kirik gmail.com)
MFC after:	3 days
2020-10-11 00:01:00 +00:00
Richard Scheffenegger
4b72ae16ed Stop sending tiny new data segments during SACK recovery
Consider the currently in-use TCP options when
calculating the amount of new data to be injected during
SACK loss recovery. That addresses the effect that very small
(new) segments could be injected on partial ACKs while
still performing a SACK loss recovery.

Reported by:	Liang Tian
Reviewed by:	tuexen, chengc_netapp.com
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26446
2020-10-09 12:44:56 +00:00
Richard Scheffenegger
868aabb470 Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26409
2020-10-09 12:06:43 +00:00
Richard Scheffenegger
5432120028 Extend netstat to display TCP stack and detailed congestion state (2)
Extend netstat to display TCP stack and detailed congestion state

Adding the "-c" option used to show detailed per-connection
congestion control state for TCP sessions.

This is one summary patch, which adds the relevant variables into
xtcpcb. As previous "spare" space is used, these changes are ABI
compatible.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26518
2020-10-09 10:55:19 +00:00
Michael Tuexen
e7a39b856a Minor cleanups.
MFC after:		3 days
2020-10-07 15:22:48 +00:00
John Baldwin
9aed26b906 Check if_capenable, not if_capabilities when enabling rate limiting.
if_capabilities is a read-only mask of supported capabilities.
if_capenable is a mask under administrative control via ifconfig(8).

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26690
2020-10-06 18:02:33 +00:00
Michael Tuexen
6f155d690b Reset delayed SACK state when restarting an SCTP association.
MFC after:		3 days
2020-10-06 14:26:05 +00:00
Michael Tuexen
b954d81662 Ensure variables are initialized before used.
MFC after:		3 days
2020-10-06 11:29:08 +00:00
Michael Tuexen
6176f9d6df Remove dead stores reported by clang static code analysis
MFC after:		3 days
2020-10-06 11:08:52 +00:00
Michael Tuexen
11daa73adc Cleanup, no functional change intended.
MFC after:		3 days
2020-10-06 10:41:04 +00:00
Michael Tuexen
c8e55b3c0c Whitespace changes.
MFC after:		3 days
2020-10-06 09:51:40 +00:00
Michael Tuexen
9f2d6263bb Use __func__ instead of __FUNCTION__ for consistency.
MFC after:		3 days
2020-10-04 15:37:34 +00:00
Michael Tuexen
d0ed75b3b1 Cleanup, no functional change intended.
MFC after:		3 days
2020-10-04 15:22:14 +00:00
Alexander V. Chernikov
fedeb08b6a Introduce scalable route multipath.
This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
 relative weights and a dataplane-optimized structure to enable
 efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
 gets compiled during group creation and is basically an array of
 nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
 nexthop or nexthop group. They are distinguished by the presense of
 NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
 should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
 This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx  NhIdx     Weight   Slots                                 Gateway     Netif  Refcnt
1         ------- ------- ------- --------------------------------------- ---------       1
              13      10       1                             2001:db8::2     vlan2
              14      20       2                             2001:db8::3     vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by:	olivier
Reviewed by:	glebius
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D26449
2020-10-03 10:47:17 +00:00
Michael Tuexen
b15f541113 Improve the input validation and processing of cookies.
This avoids setting the association in an inconsistent
state, which could result in a use-after-free situation.
This can be triggered by a malicious peer, if the peer
can modify the cookie without the local endpoint recognizing
it.
Thanks to Ned Williamson for reporting the issue.

MFC after:		3 days
2020-09-29 09:36:06 +00:00
Michael Tuexen
fbc6840bae Minor cleanup.
MFC after:		3 days
2020-09-28 14:11:53 +00:00
Michael Tuexen
1d1b4bce53 Cleanup, no functional change intended.
MFC after:		3 days
2020-09-27 13:32:02 +00:00
Michael Tuexen
8f269b8242 Improve the handling of receiving unordered and unreliable user
messages using DATA chunks. Don't use fsn_included when not being
sure that it is set to an appropriate value. If the default is
used, which is -1, this can result in SCTP associaitons not
making any user visible progress.

Thanks to Yutaka Takeda for reporting this issue for the the
userland stack in https://github.com/pion/sctp/issues/138.

MFC after:		3 days
2020-09-27 13:24:01 +00:00
Richard Scheffenegger
e399566123 TCP: send full initial window when timestamps are in use
The fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable.
That value does not consider tcp options like timestamps,
while the initial window calculation is using
the correct dynamic tcp_maxseg() function.

Due to this interaction, the last, full size segment
is considered too short and not sent out immediately.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26478
2020-09-25 10:38:19 +00:00
Richard Scheffenegger
1567c937e2 TCP newreno: improve after_idle ssthresh
Adjust ssthresh in after_idle to the maximum of
the prior ssthresh, or 3/4 of the prior cwnd. See
RFC2861 section 2 for an in depth explanation for
the rationale around this.

As newreno is the default "fall-through" reaction,
most tcp variants will benefit from this.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D22438
2020-09-25 10:23:14 +00:00
Michael Tuexen
b6db274d1e Whitespace changes.
MFC after:		3 days
2020-09-24 12:26:06 +00:00
Alexander V. Chernikov
2259a03020 Rework part of routing code to reduce difference to D26449.
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
 as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
 that rtentry can contain multiple nexthops.

Differential Revision:	https://reviews.freebsd.org/D26497
2020-09-21 20:02:26 +00:00
Alexander V. Chernikov
1440f62266 Remove unused nhop_ref_any() function.
Remove "opt_mpath.h" header where not needed.

No functional changes.
2020-09-20 21:32:52 +00:00
Mitchell Horne
374ce2488a Initialize some local variables earlier
Move the initialization of these variables to the beginning of their
respective functions.

On our end this creates a small amount of unneeded churn, as these
variables are properly initialized before their first use in all cases.
However, changing this benefits at least one downstream consumer
(NetApp) by allowing local and future modifications to these functions
to be made without worrying about where the initialization occurs.

Reviewed by:	melifaro, rscheff
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26454
2020-09-18 14:01:10 +00:00
Navdeep Parhar
b092fd6c97 if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.

Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.

Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.

Reviewed by:	kib@
Relnotes:	Yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:37:57 +00:00
Navdeep Parhar
72cc43df17 Add a knob to allow zero UDP checksums for UDP/IPv6 traffic on the given UDP port.
This will be used by some upcoming changes to if_vxlan(4).  RFC 7348 (VXLAN)
says that the UDP checksum "SHOULD be transmitted as zero.  When a packet is
received with a UDP checksum of zero, it MUST be accepted for decapsulation."
But the original IPv6 RFCs did not allow zero UDP checksum.  RFC 6935 attempts
to resolve this.

Reviewed by:	kib@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:21:15 +00:00
Michael Tuexen
42d7560796 Export the name of the congestion control. This will be used by sockstat
and netstat.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D26412
2020-09-13 09:06:50 +00:00
Richard Scheffenegger
e74e64a191 cc_mod: remove unused CCF_DELACK definition
During the DCTCP improvements, use of CCF_DELACK was
removed. This change is just to rename the unused flag
bit to prevent use of it, without also re-implementing
the tcp_input and tcp_output interfaces.

No functional change.

Reviewed by:	chengc_netapp.com, tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26181
2020-09-10 00:46:38 +00:00
Randall Stewart
285385ba56 So it turns out that syzkaller hit another crash. It has to do with switching
stacks with a SENT_FIN outstanding. Both rack and bbr will only send a
FIN if all data is ack'd so this must be enforced. Also if the previous stack
sent the FIN we need to make sure in rack that when we manufacture the
"unknown" sends that we include the proper HAS_FIN bits.

Note for BBR we take a simpler approach and just refuse to switch.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D26269
2020-09-09 11:11:50 +00:00
Bjoern A. Zeeb
67d224ef43 bbr: remove unused static function
bbr_log_type_hrdwtso() is a file local static unused function.
Remove it to avoid warnings on kernel compiles.

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D26331
2020-09-05 00:20:32 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Alexander V. Chernikov
a624ca3dff Move net/route/shared.h definitions to net/route/route_var.h.
No functional changes.

net/route/shared.h was created in the inital phases of nexthop conversion.
It was intended to serve the same purpose as route_var.h - share definitions
 of functions and structures between the routing subsystem components. At
 that time route_var.h was included by many files external to the routing
 subsystem, which largerly defeats its purpose.

As currently this is not the case anymore and amount of route_var.h includes
 is roughly the same as shared.h, retire the latter in favour of the former.
2020-08-28 22:50:20 +00:00
Michael Tuexen
404ff76bda Fix a regression with the explicit EOR mode I introduced in r364268.
A short MFC time as discussed with the secteam.

Reported by:		Taylor Brandstetter
MFC after:		1 day
2020-08-28 20:05:18 +00:00
Michael Tuexen
1951fa791e RFC 3465 defines a limit L used in TCP slow start for limiting the number
of acked bytes as described in Section 2.2 of that document.
This patch ensures that this limit is not also applied in congestion
avoidance. Applying this limit also in congestion avoidance can result in
using less bandwidth than allowed.

Reported by:		l.tian.email@gmail.com
Reviewed by:		rrs, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D26120
2020-08-25 09:42:03 +00:00
Warner Losh
773e541e8d Use devctl.h instead of bus.h to reduce newbus pollution.
There's no need for these parts of the kernel to know about newbus,
so narrow what is included to devctl.h for device_notify_*.

Suggested by: kib@
2020-08-21 00:03:24 +00:00
Andrew Gallatin
b99781834f TCP: remove special treatment for hardware (ifnet) TLS
Remove most special treatment for ifnet TLS in the TCP stack, except
for code to avoid mixing handshakes and bulk data.

This code made heroic efforts to send down entire TLS records to
NICs. It was added to improve the PCIe bus efficiency of older TLS
offload NICs which did not keep state per-session, and so would need
to re-DMA the first part(s) of a TLS record if a TLS record was sent
in multiple TCP packets or TSOs. Newer TLS offload NICs do not need
this feature.

At Netflix, we've run extensive QoE tests which show that this feature
reduces client quality metrics, presumably because the effort to send
TLS records atomically causes the server to both wait too long to send
data (leading to buffers running dry), and to send too much data at
once (leading to packet loss).

Reviewed by:	hselasky,  jhb, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26103
2020-08-19 17:59:06 +00:00
Richard Scheffenegger
ad7a0eb189 TCP Cubic: recalculate cwnd for every ACK.
Since cubic calculates cwnd based on absolute
time, retaining RFC3465 (ABC) once-per-window updates
can lead to dramatic changes of cwnd in the convex
region. Updating cwnd for each incoming ack minimizes
this delta, preventing unintentional line-rate bursts.

Reviewed by:	chengc_netapp.com, tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26060
2020-08-18 19:34:31 +00:00
Michael Tuexen
d7351394da Fix two bugs I introduced in r362563.
Found by running syzkaller.

MFC after:	3 days
2020-08-18 19:25:03 +00:00
Michael Tuexen
d59f3890c3 Remove a line which is needed and was added in
https://svnweb.freebsd.org/changeset/base/364268

MFC after:		3 days
2020-08-16 13:31:14 +00:00
Michael Tuexen
f5d30f7f76 Improve the handling of concurrent send() calls for SCTP sockets,
especially when having the explicit EOR mode enabled.

Reported by:		Megan2013678@protonmail.com
Reported by:		syzbot+bc02585076c3cc977f9b@syzkaller.appspotmail.com
MFC after:		3 days
2020-08-16 11:50:37 +00:00
Michael Tuexen
04996cb74b Enter epoch earlier. This is needed because we are exiting it also
in error cases.

MFC after:	1 week
2020-08-15 11:22:07 +00:00
Alexander V. Chernikov
2f23f45b20 Simplify dom_<rtattach|rtdetach>.
Remove unused arguments from dom_rtattach/dom_rtdetach functions and make
  them return/accept 'struct rib_head' instead of 'void **'.
Declare inet/inet6 implementations in the relevant _var.h headers similar
  to domifattach / domifdetach.
Add rib_subscribe_internal() function to accept subscriptions to the rnh
  directly.

Differential Revision:	https://reviews.freebsd.org/D26053
2020-08-14 21:29:56 +00:00
Richard Scheffenegger
a459638fc4 TCP Cubic: Have Fast Convergence Heuristic work for ECN, and align concave region
The Cubic concave region was not aligned nicely for the very first exit from
slow start, where a 50% cwnd reduction is done instead of the normal 30%.

This addresses an issue, where a short line-rate burst could result from that
sudden jump of cwnd.

In addition, the Fast Convergence Heuristic has been expanded to work also
with ECN induced congestion response.

Submitted by:	chengc_netapp.com
Reported by:	chengc_netapp.com
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25976
2020-08-13 16:45:55 +00:00
Richard Scheffenegger
2bb6dfabbe TCP Cubic: After leaving slowstart fix unintended cwnd jump.
Initializing K to zero in D23655 introduced a miscalculation,
where cwnd would suddenly jump to cwnd_max instead of gradually
increasing, after leaving slow-start.

Properly calculating K instead of resetting it to zero resolves
this issue. Also making sure, that cwnd is recalculated at the
earliest opportunity once slow-start is over.

Reported by:	chengc_netapp.com
Reviewed by:	chengc_netapp.com, tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25746
2020-08-13 16:38:51 +00:00
Richard Scheffenegger
f359d6ebbc Improve SACK support code for RFC6675 and PRR
Adding proper accounting of sacked_bytes and (per-ACK)
delivered data to the SACK scoreboard. This will
allow more aspects of RFC6675 to be implemented as well
as Proportional Rate Reduction (RFC6937).

Prior to this change, the pipe calculation controlled with
net.inet.tcp.rfc6675_pipe was also susceptible to incorrect
results when more than 3 (or 4) holes in the sequence space
were present, which can no longer all fit into a single
ACK's SACK option.

Reviewed by:	kbowling, rgrimes (mentor)
Approved by:	rgrimes (mentor, blanket)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D18624
2020-08-13 16:30:09 +00:00
Hans Petter Selasky
b453d3d239 Use a static initializer for the multicast free tasks.
This makes the SYSINIT() function updated in r364072 superfluous.

Suggested by:	glebius@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2020-08-11 08:31:40 +00:00
Michael Tuexen
cf8a49ab6e Fix the following issues related to the TCP SYN-cache:
* Let the accepted TCP/IPv4 socket inherit the configured TTL and
  TOS value.
* Let the accepted TCP/IPv6 socket inherit the configured Hop Limit.
* Use the configured Hop Limit and Traffic Class when sending
  IPv6 packets.

Reviewed by:		rrs, lutz_donnerhacke.de
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D25909
2020-08-10 20:24:48 +00:00
Bjoern A. Zeeb
f9461246a2 MC: add a note with reference to the discussion and history as-to why we
are where we are now.  The main thing is to try to get rid of the delayed
freeing to avoid blocking on the taskq when shutting down vnets.

X-Timeout:	if you still see this before 14-RELEASE remove it.
2020-08-10 10:58:43 +00:00
Hans Petter Selasky
3689652c65 Make sure the multicast release tasks are properly drained when
destroying a VNET or a network interface.

Else the inm release tasks, both IPv4 and IPv6 may cause a panic
accessing a freed VNET or network interface.

Reviewed by:		jmg@
Discussed with:		bz@
Differential Revision:	https://reviews.freebsd.org/D24914
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2020-08-10 10:46:08 +00:00
Hans Petter Selasky
a95ef9d38d Use proper prototype for SYSINIT() functions.
Mark the unused argument using the __unused macro.

Discussed with:		kib@
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2020-08-10 10:40:19 +00:00
Michael Tuexen
1bea15e601 Improve the ECN negotiation when the TCP SYN-cache is used by making
sure that
* ECN is disabled if the client sends an non-ECN-setup SYN segment.
* ECN is disabled is the ECN-setup SYN-ACK segment is retransmitted more
  than net.inet.tcp.ecn.maxretries times.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D26008
2020-08-08 19:39:38 +00:00
Bjoern A. Zeeb
a9839c4aee IPV6_PKTINFO support for v4-mapped IPv6 sockets
When using v4-mapped IPv6 sockets with IPV6_PKTINFO we do not
respect the given v4-mapped src address on the IPv4 socket.
Implement the needed functionality. This allows single-socket
UDP applications (such as OpenVPN) to work better on FreeBSD.

Requested by:	Gert Doering (gert greenie.net), pfsense
Tested by:	Gert Doering (gert greenie.net)
Reviewed by:	melifaro
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24135
2020-08-07 15:13:53 +00:00
Randall Stewart
8315f1ea26 The recent changes to move the ref count increment
back from the end of the function created an issue.
If one of the routines returns NULL during setup
we have inp's with extra references (which is why
the increment was at the end).

Also the stack switch return code was being ignored
and actually has meaning if the stack cannot take over
it should return NULL.

Fix both of these situation by being sure to test the
return code and of course in any case of return NULL (there
are 3) make sure we properly reduce the ref count.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D25903
2020-07-31 10:03:32 +00:00
Michael Tuexen
205f3e1597 Clear the pointer to the socket when closing it also in case of
an ungraceful operation.
This fixes a use-after-free bug found and reported by Taylor
Brandstetter of Google by testing the userland stack.

MFC after:		1 week
2020-07-23 19:43:49 +00:00
Michael Tuexen
91e04f9e7a Detect and handle an invalid reassembly constellation, which results in
a memory leak.

Thanks to Felix Weinrank for finding this issue using fuzz testing the
userland stack.

MFC after:		1 week
2020-07-23 01:35:24 +00:00
Richard Scheffenegger
cce999b38f Fix style and comment around concave/convex regions in TCP cubic.
In cubic, the concave region is when snd_cwnd starts growing slower
towards max_cwnd (cwnd at the time of the congestion event), and
the convex region is when snd_cwnd starts to grow faster and
eventually appearing like slow-start like growth.

PR:		238478
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24657
2020-07-21 16:21:52 +00:00
Richard Scheffenegger
66ba9aafcf Add MODULE_VERSION to TCP loadable congestion control modules.
Without versioning information, using preexisting loader /
linker code is not easily possible when another module may
have dependencies on pre-loaded modules, and also doesn't
allow the automatic loading of dependent modules.

No functional change of the actual modules.

Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25744
2020-07-20 23:47:27 +00:00
Michael Tuexen
8745f898c4 Add reference counts for inp/stcb/net when timers are running.
This avoids a use-after-free reported for the userland stack.
Thanks to Taylor Brandstetter for suggesting a patch for
the userland stack.

MFC after:		1 week
2020-07-19 12:34:19 +00:00
Michael Tuexen
05bceec68e Remove code which is not needed.
MFC after:		1 week
2020-07-18 13:10:02 +00:00
Michael Tuexen
7f0ad2274b Improve the locking of address lists by adding some asserts and
rearranging the addition of address such that the lock is not
given up during checking and adding.

MFC after:		1 week
2020-07-17 15:09:49 +00:00
Michael Tuexen
f903a308a1 (Re)-allow 0.0.0.0 to be used as an address in connect() for TCP
In r361752 an error handling was introduced for using 0.0.0.0 or
255.255.255.255 as the address in connect() for TCP, since both
addresses can't be used. However, the stack maps 0.0.0.0 implicitly
to a local address and at least two regressions were reported.
Therefore, re-allow the usage of 0.0.0.0.
While there, change the error indicated when using 255.255.255.255
from EAFNOSUPPORT to EACCES as mentioned in the man-page of connect().

Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D25401
2020-07-16 16:46:24 +00:00
Michael Tuexen
504ee6a001 Improve the error handling in generating ASCONF chunks.
In case of errors, the cleanup was not consistent.
Thanks to Felix Weinrank for fuzzing the userland stack and making
me aware of the issue.

MFC after:		1 week
2020-07-14 20:32:50 +00:00
Michael Tuexen
cd7518203d Cleanup, no functional change intended.
This file is only compiled if INET or INET6 is defined. So there
is no need for checking that.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D25635
2020-07-12 18:34:09 +00:00
Michael Tuexen
83b8204f61 (Re)activate SCTP system calls when compiling SCTP support into the kernel
r363079 introduced the possibility of loading the SCTP stack as a module in
addition to compiling it into the kernel. As part of this, the registration
of the system calls was removed and put into the loading of the module.
Therefore, the system calls are not registered anymore when compiling the
SCTP into the kernel. This patch addresses that.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D25632
2020-07-12 14:50:12 +00:00
Michael Tuexen
4ef7c2f28f Whitespace changes due to upstreaming r363079. 2020-07-10 16:59:06 +00:00
Mark Johnston
052c5ec4d0 Provide support for building SCTP as a loadable module.
With this change, a kernel compiled with "options SCTP_SUPPORT" and
without "options SCTP" supports dynamic loading of the SCTP stack.

Currently sctp.ko cannot be unloaded since some prerequisite teardown
logic is not yet implemented.  Attempts to unload the module will return
EOPNOTSUPP.

Discussed with:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21997
2020-07-10 14:56:05 +00:00
Michael Tuexen
6ddc843832 Fix a use-after-free bug for the userland stack. The kernel
stack is not affected.
Thanks to Mark Wodrich from Google for finding and reporting the
bug.

MFC after:		1 week
2020-07-10 11:15:10 +00:00
Michael Tuexen
b6734d8f4a Optimize flushing of receive queues.
This addresses an issue found and reported for the userland stack in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=21243

MFC after:		1 week
2020-07-09 16:18:42 +00:00
Michael Tuexen
fcbfdc0ab6 Improve consistency.
MFC after:		1 week
2020-07-08 16:23:40 +00:00
Michael Tuexen
ef9095c72a Fix error description.
MFC after:		1 week
2020-07-08 16:04:06 +00:00
Michael Tuexen
c96d7c373e Don't accept FORWARD-TSN chunks when I-FORWARD-TSN was negotiated
and vice versa.

MFC after:		1 week
2020-07-08 15:49:30 +00:00
Michael Tuexen
32df1c9ebb Improve handling of PKTDROP chunks. This includes the input validation
to address two issues found by ossfuzz testing the userland stack:
* https://oss-fuzz.com/testcase-detail/5387560242380800
* https://oss-fuzz.com/testcase-detail/4887954068865024
and adding support for I-DATA chunks in addition to DATA chunks.
2020-07-08 12:25:19 +00:00
Richard Scheffenegger
c201ce0b4a Fix KASSERT during tcp_newtcpcb when low on memory
While testing with system default cc set to cubic, and
running a memory exhaustion validation, FreeBSD panics for a
missing inpcb reference / lock.

Reviewed by:	rgrimes (mentor), tuexen (mentor)
Approved by:	rgrimes (mentor), tuexen (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25583
2020-07-07 12:10:59 +00:00
Alexander V. Chernikov
6ad7446c6f Complete conversions from fib<4|6>_lookup_nh_<basic|ext> to fib<4|6>_lookup().
fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees
 over pointer validness and requiring on-stack data copying.

With no callers remaining, remove fib[46]_lookup_nh_ functions.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25445
2020-07-02 21:04:08 +00:00
Michael Tuexen
e54b7cd007 Fix the cleanup handling in a error path for TCP BBR.
Reported by:		syzbot+df7899c55c4cc52f5447@syzkaller.appspotmail.com
Reviewed by:		rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D25486
2020-07-01 17:17:06 +00:00
Mark Johnston
d16a2e4784 Fix a possible next-hop refcount leak when handling IPSec traffic.
It may be possible to fix this by deferring the lookup, but let's
keep the initial change simple to make MFCs easier.

PR:		246951
Reviewed by:	melifaro
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25519
2020-07-01 15:42:48 +00:00
Michael Tuexen
7a3f60e7f5 Fix a bug introduced in https://svnweb.freebsd.org/changeset/base/362173
Reported by:		syzbot+f3a6fccfa6ae9d3ded29@syzkaller.appspotmail.com
MFC after:		1 week
2020-06-30 21:50:05 +00:00
Michael Tuexen
e99ce3eac5 Don't send packets containing ERROR chunks in response to unknown
chunks when being in a state where the verification tag to be used
is not known yet.

MFC after:		1 week
2020-06-28 14:11:36 +00:00
Michael Tuexen
f2f66ef6d2 Don't check ch for not being NULL, since that is true.
MFC after:		1 week
2020-06-28 11:12:03 +00:00
John Baldwin
4a711b8d04 Use zfree() instead of explicit_bzero() and free().
In addition to reducing lines of code, this also ensures that the full
allocation is always zeroed avoiding possible bugs with incorrect
lengths passed to explicit_bzero().

Suggested by:	cem
Reviewed by:	cem, delphij
Approved by:	csprng (cem)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25435
2020-06-25 20:17:34 +00:00
Michael Tuexen
132c073866 Fix the acconting for fragmented unordered messages when using
interleaving.
This was reported for the userland stack in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19321

MFC after:		1 week
2020-06-24 14:47:51 +00:00
Richard Scheffenegger
6e26dd0dbe TCP: fix cubic RTO reaction.
Proper TCP Cubic operation requires the knowledge
of the maximum congestion window prior to the
last congestion event.

This restores and improves a bugfix previously added
by jtl@ but subsequently removed due to a revert.

Reported by:	chengc_netapp.com
Reviewed by:	chengc_netapp.com, tuexen (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25133
2020-06-24 13:52:53 +00:00
Richard Scheffenegger
9dc7d8a246 TCP: make after-idle work for transactional sessions.
The use of t_rcvtime as proxy for the last transmission
fails for transactional IO, where the client requests
data before the server can respond with a bulk transfer.

Set aside a dedicated variable to actually track the last
locally sent segment going forward.

Reported by:	rrs
Reviewed by:	rrs, tuexen (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25016
2020-06-24 13:42:42 +00:00
Michael Tuexen
87c0bf77d9 Fix alignment issue manifesting in the userland stack.
MFC after:		1 wwek
2020-06-23 23:05:05 +00:00
Michael Tuexen
b88082dd39 No need to include netinet/sctp_crc32.h twice. 2020-06-22 14:36:14 +00:00
Mark Johnston
e6db509d10 Move the definition of SCTP's system_base_info into sctp_crc32.c.
This file is the only SCTP source file compiled into the kernel when
SCTP_SUPPORT is configured.  sctp_delayed_checksum() references a couple
of counters defined in system_base_info, so the change allows these
counters to be referenced in a kernel compiled without "options SCTP".

Submitted by:	tuexen
MFC with:	r362338
2020-06-22 14:01:31 +00:00
Michael Tuexen
c5d9e5c99e Cleanup the defintion of struct sctp_getaddresses. This stucture
is used by the IPPROTO_SCTP level socket options SCTP_GET_PEER_ADDRESSES
and SCTP_GET_LOCAL_ADDRESSES, which are used by libc to implement
sctp_getladdrs() and sctp_getpaddrs().
These changes allow an old libc to work on a newer kernel.
2020-06-21 23:12:56 +00:00
Bjoern A. Zeeb
e387af1fa8 Rather than zeroing MAXVIFS times size of pointer [r362289] (still better than
sizeof pointer before [r354857]), we need to zero MAXVIFS times the size of
the struct.  All good things come in threes; I hope this is it on this one.

PR:		246629, 206583
Reported by:	kib
MFC after:	ASAP
2020-06-21 22:09:30 +00:00
Michael Tuexen
171edd2110 Fix the build for an INET6 only configuration.
The fix from the last commit is actually needed twice...

MFC after:		1 week
2020-06-21 09:56:09 +00:00
Michael Tuexen
5087b6e732 Set a variable also in the case of an INET6 only kernel
MFC after:		1 week
2020-06-20 23:48:57 +00:00
Michael Tuexen
ed82c2edd6 Use a struct sockaddr_in pr struct sockaddr_in6 as the option value
for the IPPROTO_SCTP level socket options SCTP_BINDX_ADD_ADDR and
SCTP_BINDX_REM_ADDR. These socket option are intended for internal
use only to implement sctp_bindx().
This is one user of struct sctp_getaddresses less.
struct sctp_getaddresses is strange and will be changed shortly.
2020-06-20 21:06:02 +00:00
Michael Tuexen
7621bd5ead Cleanup the adding and deleting of addresses via sctp_bindx().
There is no need to use the association identifier, so remove it.
While there, cleanup the code a bit.

MFC after:		1 week
2020-06-20 20:20:16 +00:00
Michael Tuexen
7a9dbc33f9 Remove last argument of sctp_addr_mgmt_ep_sa(), since it is not used.
MFC after:		1 week
2020-06-19 12:35:29 +00:00
Mark Johnston
95033af923 Add the SCTP_SUPPORT kernel option.
This is in preparation for enabling a loadable SCTP stack.  Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.

Discussed with:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-06-18 19:32:34 +00:00
Bjoern A. Zeeb
ce19cceb8d When converting the static arrays to mallocarray() in r356621 I missed
one place where we now need to multiply the size of the struct with the
number of entries.  This lead to problems when restarting user space
daemons, as the cleanup was never properly done, resulting in MRT_ADD_VIF
EADDRINUSE.
Properly zero all array elements to avoid this problem.

PR:		246629, 206583
Reported by:	(many)
MFC after:	4 days
Sponsored by:	Rubicon Communications, LLC (d/b/a "Netgate")
2020-06-17 21:04:38 +00:00
Bjoern A. Zeeb
b7b3d237e7 The call into ifa_ifwithaddr() needs to be epoch protected; ortherwise
we'll panic on an assertion.
While here, leave a comment that the ifp was never protected and stable
(as glebius pointed out) and this needs to be fixed properly.

Discovered while working on:	PR 246629
Reviewed by:	glebius
MFC after:	4 days
Sponsored by:	Rubicon Communications, LLC (d/b/a "Netgate")
2020-06-17 20:58:37 +00:00
Michael Tuexen
2d87bacde4 Allow the self reference to be NULL in case the timer was stopped.
Submitted by:		Timo Voelker
MFC after:		1 week
2020-06-17 15:27:45 +00:00
Tom Jones
d88fe3d964 Add header definition for RFC4340, Datagram Congestion Control Protocol
Add a header definition for DCCP as defined in RFC4340. This header definition
is required to perform validation when receiving and forwarding DCCP packets.
We do not currently support DCCP.

Reviewed by:	gallatin, bz
Approved by:	bz (co-mentor)
MFC after:	1 week
MFC with:	350749
Differential Revision:	https://reviews.freebsd.org/D21179
2020-06-17 13:27:13 +00:00
Randall Stewart
95ef69c63c iSo in doing final checks on OCA firmware with all the latest tweaks the dup-ack checking
packet drill script was failing with a number of unexpected acks. So it turns
out if you have the default recvwin set up to 1Meg (like OCA's do) and you
have no window scaling (like the dupack checking code) then we have another
case where we are always trying to update the rwnd and sending an
ack when we should not.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D25298
2020-06-16 18:16:45 +00:00
Randall Stewart
4d418f8da8 So it turns out rack has a shortcoming in dup-ack counting. It counts the dupacks but
then does not properly respond to them. This is because a few missing bits are not present.
BBR actually does properly respond (though it also sends a TLP which is interesting and
maybe something to fix)..

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D25294
2020-06-16 12:26:23 +00:00
Michael Tuexen
b231bff8b2 Allocate the mbuf for the signature in the COOKIE or the correct size.
While there, do also do some cleanups.

MFC after:		1 week
2020-06-14 16:05:08 +00:00
Michael Tuexen
4471043177 Cleanups, no functional change.
MFC after:		1 week
2020-06-14 09:50:00 +00:00
Michael Tuexen
d60bdf8569 Remove usage of empty macro.
MFC after:		1 week
2020-06-13 21:23:26 +00:00
Michael Tuexen
64c8fc5de8 Simpify a condition, no functional change.
MFC after:		1 week
2020-06-13 18:38:59 +00:00
Randall Stewart
f092a3c71c So it turns out with the right window scaling you can get the code in all stacks to
always want to do a window update, even when no data can be sent. Now in
cases where you are not pacing thats probably ok, you just send an extra
window update or two. However with bbr (and rack if its paced) every time
the pacer goes off its going to send a "window update".

Also in testing bbr I have found that if we are not responding to
data right away we end up staying in startup but incorrectly holding
a pacing gain of 192 (a loss). This is because the idle window code
does not restict itself to only work with PROBE_BW. In all other
states you dont want it doing a PROBE_BW state change.

Sponsored by:	Netflix Inc.
Differential Revision: 	https://reviews.freebsd.org/D25247
2020-06-12 19:56:19 +00:00
Michael Tuexen
3ee11586b2 Whitespace change due to upstream cleanup.
MFC after:		1 week
2020-06-12 16:40:10 +00:00
Michael Tuexen
2f9e6db0be More cleanups due to ifdef cleanup done upstream
MFC after:		1 week
2020-06-12 16:31:13 +00:00
Michael Tuexen
306c2ba375 Small cleanup due to upstream ifdef cleanups.
MFC after:		1 week
2020-06-12 10:13:23 +00:00
Michael Tuexen
28397ac1ed Non-functional changes due to upstream cleanup.
MFC after:		1 week
2020-06-11 13:34:09 +00:00
Richard Scheffenegger
2fda0a6f3a Prevent TCP Cubic to abruptly increase cwnd after app-limited
Cubic calculates the new cwnd based on absolute time
elapsed since the start of an epoch. A cubic epoch is
started on congestion events, or once the congestion
avoidance phase is started, after slow-start has
completed.

When a sender is application limited for an extended
amount of time and subsequently a larger volume of data
becomes ready for sending, Cubic recalculates cwnd
with a lingering cubic epoch. This recalculation
of the cwnd can induce a massive increase in cwnd,
causing a burst of data to be sent at line rate by
the sender.

This adds a flag to reset the cubic epoch once a
session transitions from app-limited to cwnd-limited
to prevent the above effect.

Reviewed by:	chengc_netapp.com, tuexen (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D25065
2020-06-10 07:32:02 +00:00
Richard Scheffenegger
6907bbae18 Prevent TCP Cubic to abruptly increase cwnd after slow-start
Introducing flags to track the initial Wmax dragging and exit
from slow-start in TCP Cubic. This prevents sudden jumps in the
caluclated cwnd by cubic, especially when the flow is application
limited during slow start (cwnd can not grow as fast as expected).
The downside is that cubic may remain slightly longer in the
concave region before starting the convex region beyond Wmax again.

Reviewed by:	chengc_netapp.com, tuexen (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor, blanket)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23655
2020-06-09 21:07:58 +00:00
Michael Tuexen
5fb132abbb Whitespace cleanups and removal of a stale comment.
MFC after:		1 week
2020-06-08 20:23:20 +00:00
Randall Stewart
e854dd38ac An important statistic in determining if a server process (or client) is being delayed
is to know the time to first byte in and time to first byte out. Currently we
have no way to know these all we have is t_starttime. That (t_starttime) tells us
what time the 3 way handshake completed. We don't know when the first
request came in or how quickly we responded. Nor from a client perspective
do we know how long from when we sent out the first byte before the
server responded.

This small change adds the ability to track the TTFB's. This will show up in
BB logging which then can be pulled for later analysis. Note that currently
the tracking is via the ticks variable of all three variables. This provides
a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be
to change all three of these values into something with a much finer resolution
(either microseconds or nanoseconds), though we may want to make the resolution
configurable so that on lower powered machines we could still use the much
cheaper ticks variable.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D24902
2020-06-08 11:48:07 +00:00
Michael Tuexen
70486b27ae Retire SCTP_SO_LOCK_TESTING.
This was intended to test the locking used in the MacOS X kernel on a
FreeBSD system, to make use of WITNESS and other debugging infrastructure.
This hasn't been used for ages, to take it out to reduce the #ifdef
complexity.

MFC after:		1 week
2020-06-07 14:39:20 +00:00
Michael Tuexen
3f53d62236 Fix typo in comment.
Submitted by Orgad Shaneh for the userland stack.
MFC after:		1 week
2020-06-06 21:26:34 +00:00
Michael Tuexen
2cf3347109 Non-functional changes due to cleanup (upstream removing of Panda support)
of the code

MFC after:		1 week
2020-06-06 18:20:09 +00:00
Randall Stewart
2cf21ae559 We should never allow either the broadcast or IN_ADDR_ANY to be
connected to or sent to. This was fond when working with Michael
Tuexen and Skyzaller. Skyzaller seems to want to use either of
these two addresses to connect to at times. And it really is
an error to do so, so lets not allow that behavior.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D24852
2020-06-03 14:16:40 +00:00
Randall Stewart
f1ea4e4120 This fixes a couple of skyzaller crashes. Most
of them have to do with TFO. Even the default stack
had one of the issues:

1) We need to make sure for rack that we don't advance
   snd_nxt beyond iss when we are not doing fast open. We
   otherwise can get a bunch of SYN's sent out incorrectly
   with the seq number advancing.
2) When we complete the 3-way handshake we should not ever
   append to reassembly if the tlen is 0, if TFO is enabled
   prior to this fix we could still call the reasemmbly. Note
   this effects all three stacks.
3) Rack like its cousin BBR should track if a SYN is on a
   send map entry.
4) Both bbr and rack need to only consider len incremented on a SYN
   if the starting seq is iss, otherwise we don't increment len which
   may mean we return without adding a sendmap entry.

This work was done in collaberation with Michael Tuexen, thanks for
all the testing!
Sponsored by:	Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D25000
2020-06-03 14:07:31 +00:00
Michael Tuexen
d442a65733 Restrict enabling TCP-FASTOPEN to end-points in CLOSED or LISTEN state
Enabling TCP-FASTOPEN on an end-point which is in a state other than
CLOSED or LISTEN, is a bug in the application. So it should not work.
Also the TCP code does not (and needs not to) handle this.
While there, also simplify the setting of the TF_FASTOPEN flag.

This issue was found by running syzkaller.

Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D25115
2020-06-03 13:51:53 +00:00
Alexander V. Chernikov
da187ddb3d * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D25067
2020-06-01 20:49:42 +00:00
Alexander V. Chernikov
e7403d0230 Revert r361704, it accidentally committed merged D25067 and D25070. 2020-06-01 20:40:40 +00:00
Alexander V. Chernikov
79674562b8 * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:	ae
Differential Revision: https://reviews.freebsd.org/D25067
2020-06-01 20:32:02 +00:00
Alexander V. Chernikov
a37a5246ca Use fib[46]_lookup() in mtu calculations.
fib[46]_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Differential Revision:	https://reviews.freebsd.org/D24974
2020-05-28 08:00:08 +00:00
Alexander V. Chernikov
3553b3007f Switch ip_output/icmp_reflect rt lookup calls with fib4_lookup.
fib4_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24976
2020-05-28 07:31:53 +00:00
Alexander V. Chernikov
7bfc98af12 Switch gif(4) path verification to fib[46]_check_urfp().
fibX_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Use specialized fib[46]_check_urpf() from newer KPI instead,
to allow removal of older KPI.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24978
2020-05-28 07:26:18 +00:00
Emmanuel Vadot
77c68315f6 bbr: Use arc4random_uniform from libkern.
This unbreak LINT build

Reported by:	jenkins, melifaro
2020-05-23 19:52:20 +00:00
Alexander V. Chernikov
4d2c2509f2 Move <add|del|change>_route() functions to route_ctl.c in preparation of
multipath control plane changed described in D24141.

Currently route.c contains core routing init/teardown functions, route table
 manipulation functions and various helper functions, resulting in >2KLOC
 file in total. This change moves most of the route table manipulation parts
 to a dedicated file, simplifying planned multipath changes and making
 route.c more manageable.

Differential Revision:	https://reviews.freebsd.org/D24870
2020-05-23 19:06:57 +00:00
Richard Scheffenegger
e68cde59c3 DCTCP: update alpha only once after loss recovery.
In mixed ECN marking and loss scenarios it was found, that
the alpha value of DCTCP is updated two times. The second
update happens with freshly initialized counters indicating
to ECN loss. Overall this leads to alpha not adjusting as
quickly as expected to ECN markings, and therefore lead to
excessive loss.

Reported by:	Cheng Cui
Reviewed by:	chengc_netapp.com, rrs, tuexen (mentor)
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24817
2020-05-21 21:42:49 +00:00
Richard Scheffenegger
af2fb894c9 With RFC3168 ECN, CWR SHOULD only be sent with new data
Overly conservative data receivers may ignore the CWR flag
on other packets, and keep ECE latched. This can result in
continous reduction of the congestion window, and very poor
performance when ECN is enabled.

Reviewed by:	rgrimes (mentor), rrs
Approved by:	rgrimes (mentor), tuexen (mentor)
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23364
2020-05-21 21:33:15 +00:00
Richard Scheffenegger
8e0511652b Retain only mutually supported TCP options after simultaneous SYN
When receiving a parallel SYN in SYN-SENT state, remove all the
options only we supported locally before sending the SYN,ACK.

This addresses a consistency issue on parallel opens.

Also, on such a parallel open, the stack could be coaxed into
running with timestamps enabled, even if administratively disabled.

Reviewed by:	tuexen (mentor)
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23371
2020-05-21 21:26:21 +00:00
Richard Scheffenegger
6e16d87751 Handle ECN handshake in simultaneous open
While testing simultaneous open TCP with ECN, found that
negotiation fails to arrive at the expected final state.

Reviewed by:	tuexen (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23373
2020-05-21 21:15:25 +00:00
Mark Johnston
591b09b486 Define a module version for accept filter modules.
Otherwise accept filters compiled into the kernel do not preempt
preloaded accept filter modules.  Then, the preloaded file registers its
accept filter module before the kernel, and the kernel's attempt fails
since duplicate accept filter list entries are not permitted.  This
causes the preloaded file's module to be released, since
module_register_init() does a lookup by name, so the preloaded file is
unloaded, and the accept filter's callback points to random memory since
preload_delete_name() unmaps the file on x86 as of r336505.

Add a new ACCEPT_FILTER_DEFINE macro which wraps the accept filter and
module definitions, and ensures that a module version is defined.

PR:		245870
Reported by:	Thomas von Dein <freebsd@daemon.de>
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-05-19 18:35:08 +00:00
Michael Tuexen
999f86d67d Replace snprintf() by SCTP_SNPRINTF() and let SCTP_SNPRINTF() map
to snprintf() on FreeBSD. This allows to check for failures of snprintf()
on platforms other than FreeBSD kernel.
2020-05-19 07:23:35 +00:00
Michael Tuexen
821bae7cf3 Revert r361209:
cem noted that on FreeBSD snprintf() can not fail and code should not
check for that.

A followup commit will replace the usage of snprintf() in the SCTP
sources with a variadic macro SCTP_SNPRINTF, which will simply map to
snprintf() on FreeBSD and do a checking similar to r361209 on
other platforms.
2020-05-19 07:21:11 +00:00
Mike Karels
2fdbcbea76 Fix NULL-pointer bug from r361228.
Note that in_pcb_lport and in_pcb_lport_dest can be called with a NULL
local address for IPv6 sockets; handle it.  Found by syzkaller.

Reported by:	cem
MFC after:	1 month
2020-05-19 01:05:13 +00:00
Mike Karels
2510235150 Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by:	tuexen (rscheff, rrs previous version)
MFC after:	1 month
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D24781
2020-05-18 22:53:12 +00:00
Michael Tuexen
6863ab0b8b Remove assignment without effect.
MFC after:	3 days
2020-05-18 19:48:38 +00:00
Michael Tuexen
9e8c9c9ef6 Don't check an unsigned variable for being negative.
MFC after:		3 days.
2020-05-18 19:35:46 +00:00
Michael Tuexen
04ce5df9c9 Remove redundant assignment.
MFC after:		3 days
2020-05-18 19:23:01 +00:00
Michael Tuexen
bca1802890 Cleanup, no functional change intended.
MFC after:		3 days
2020-05-18 18:42:43 +00:00
Michael Tuexen
6395219ae3 Avoid an integer underflow.
MFC after:		3 days
2020-05-18 18:32:58 +00:00
Michael Tuexen
60017c8e88 Remove redundant check.
MFC after:		3 days
2020-05-18 18:27:10 +00:00
Michael Tuexen
88116b7e4d Fix logical condition by looking at usecs.
This issue was found by cpp-check running on the userland stack.

MFC after:		3 days
2020-05-18 15:02:15 +00:00
Michael Tuexen
00023d8a87 Whitespace change.
MFC after:		3 days
2020-05-18 15:00:18 +00:00
Michael Tuexen
e708e2a4f4 Handle failures of snprintf().
MFC after:		3 days
2020-05-18 10:07:01 +00:00
Michael Tuexen
da8c34c382 Non-functional changes, cleanups.
MFC after:		3 days
2020-05-17 22:31:38 +00:00
Alexander V. Chernikov
174fb9dbb1 Remove redundant checks for nhop validity.
Currently NH_IS_VALID() simly aliases to RT_LINK_IS_UP(), so we're
 checking the same thing twice.

In the near future the implementation of this check will be simpler,
 as there are plans to introduce control-plane interface status monitoring
 similar to ipfw interface tracker.
2020-05-17 15:32:36 +00:00
Michael Tuexen
daf143413a Ensure that an stcb is not dereferenced when it is about to be
freed.
This issue was found by SYZKALLER.

MFC after:		3 days
2020-05-16 19:26:39 +00:00
Ed Maste
65a1d63665 libalias: retire cuseeme support
The CU-SeeMe videoconferencing client and associated protocol is at this
point a historical artifact; there is no need to retain support for this
protocol today.

Reviewed by:	philip, markj, allanjude
Relnotes:	Yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24790
2020-05-16 02:29:10 +00:00
Michael Tuexen
e240ce42bf Allow only IPv4 addresses in sendto() for TCP on AF_INET sockets.
This problem was found by looking at syzkaller reproducers for some other
problems.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24831
2020-05-15 14:06:37 +00:00
Randall Stewart
777b88d60f This fixes several skyzaller issues found with the
help of Michael Tuexen. There was some accounting
errors with TCPFO for bbr and also for both rack
and bbr there was a FO case where we should be
jumping to the just_return_nolock label to
exit instead of returning 0. This of course
caused no timer to be running and thus the
stuck sessions.

Reported by: Michael Tuexen and Skyzaller
Sponsored by: Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D24852
2020-05-15 14:00:12 +00:00
Ed Maste
46701f31be libalias: fix potential memory disclosure from ftp module
admbugs:	956
Submitted by:	markj
Reported by:	Vishnu Dev TJ working with Trend Micro Zero Day Initiative
Security:	FreeBSD-SA-20:13.libalias
Security:	CVE-2020-7455
Security:	ZDI-CAN-10849
2020-05-12 16:38:28 +00:00
Ed Maste
6461c83e09 libalias: validate packet lengths before accessing headers
admbugs:	956
Submitted by:	ae
Reported by:	Lucas Leong (@_wmliang_) of Trend Micro Zero Day Initiative
Reported by:	Vishnu working with Trend Micro Zero Day Initiative
Security:	FreeBSD-SA-20:12.libalias
2020-05-12 16:33:04 +00:00
Michael Tuexen
86fd36c502 Fix a copy and paste error introduced in r360878.
Reported-by:		syzbot+a0863e972771f2f0d4b3@syzkaller.appspotmail.com
Reported-by:		syzbot+4481757e967ba83c445a@syzkaller.appspotmail.com
MFC after:		3 days
2020-05-11 22:47:20 +00:00
Alexander V. Chernikov
1d1a743e9f Fix NOINET[6] build by using af-independent route lookup function.
Reported by:	rpokala
2020-05-11 20:41:03 +00:00
Andrew Gallatin
6043ac201a Ktls: never skip stamping tags for NIC TLS
The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.

When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped.  Failure
to stamp a NIC TLS packet will result in crypto
issues.

Reviewed by:	hselasky, rrs
Sponsored by:	Netflix, Mellanox
2020-05-11 19:17:33 +00:00
Michael Tuexen
83ed508055 Ensure that the SCTP iterator runs with an stcb and inp, which belong to
each other.

Reported by:	syzbot+82d39d14f2f765e38db0@syzkaller.appspotmail.com
MFC after:	3 days
2020-05-10 22:54:30 +00:00
Michael Tuexen
9d176904ae Remove trailing whitespace. 2020-05-10 17:43:42 +00:00
Michael Tuexen
efd5e69291 Ensure that we have a path when starting the T3 RXT timer.
Reported by:	syzbot+f2321629047f89486fa3@syzkaller.appspotmail.com
MFC after:	3 days
2020-05-10 17:19:19 +00:00
Michael Tuexen
8123bbf186 Only drop DATA chunk with lower priorities as specified in RFC 7496.
This issue was found by looking at a reproducer generated by syzkaller.

MFC after:		3 days
2020-05-10 10:03:10 +00:00
Randall Stewart
b1ddcbc62c When in the SYN-SENT state bbr and rack will not properly send an ACK but instead start the D-ACK timer. This
causes so_reuseport_lb_test to fail since it slows down how quickly the program runs until the timeout occurs
and fails the test

Sponsored by: Netflix inc.
Differential Revision:	https://reviews.freebsd.org/D24747
2020-05-07 20:29:38 +00:00
Randall Stewart
8717b8f1bb NF has an internal option that changes the tcp_mcopy_m routine slightly (has
a few extra arguments). Recently that changed to only have one arg extra so
that two ifdefs around the call are no longer needed. Lets take out the
extra ifdef and arg.

Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D24736
2020-05-07 10:46:02 +00:00
Michael Tuexen
cb9fb7b2cb Avoid underflowing a variable, which would result in taking more
data from the stream queues then needed.

Thanks to Timo Voelker for finding this bug and providing a fix.

MFC after:		3 days
2020-05-05 19:54:30 +00:00
Michael Tuexen
d3c3d6f99c Fix the computation of the numbers of entries of the mapping array to
look at when generating a SACK. This was wrong in case of sequence
numbers wrap arounds.

Thanks to Gwenael FOURRE for reporting the issue for the userland stack:
https://github.com/sctplab/usrsctp/issues/462
MFC after:		3 days
2020-05-05 17:52:44 +00:00
Michael Tuexen
51a5392297 Add net epoch support back, which was taken out by accident in
https://svnweb.freebsd.org/changeset/base/360639

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24694
2020-05-04 23:05:11 +00:00
Randall Stewart
570045a0fc This fixes two issues found by ankitraheja09@gmail.com
1) When BBR retransmits the syn it was messing up the snd_max
2) When we need to send a RST we might not send it when we should

Reported by:	ankitraheja09@gmail.com
Sponsored by:  Netflix.com
Differential Revision: https://reviews.freebsd.org/D24693
2020-05-04 23:02:58 +00:00
Michael Tuexen
7985fd7e76 Enter the net epoch before calling the output routine in TCP BBR.
This was only triggered when setting the IPPROTO_TCP level socket
option TCP_DELACK.
This issue was found by runnning an instance of SYZKALLER.
Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24690
2020-05-04 22:02:49 +00:00
Randall Stewart
963fb2ad94 This commit brings things into sync with the advancements that
have been made in rack and adds a few fixes in BBR. This also
removes any possibility of incorrectly doing OOB data the stacks
do not support it. Should fix the skyzaller crashes seen in the
past. Still to fix is the BBR issue just reported this weekend
with the SYN and on sending a RST. Note that this version of
rack can now do pacing as well.

Sponsored by:Netflix Inc
Differential Revision:https://reviews.freebsd.org/D24576
2020-05-04 20:28:53 +00:00
Randall Stewart
d3b6c96b7d Adjust the fb to have a way to ask the underlying stack
if it can support the PRUS option (OOB). And then have
the new function call that to validate and give the
correct error response if needed to the user (rack
and bbr do not support obsoleted OOB data).

Sponsoered by: Netflix Inc.
Differential Revision:	 https://reviews.freebsd.org/D24574
2020-05-04 20:19:57 +00:00
Alexander V. Chernikov
9e02229580 Remove now-unused rt_ifp,rt_ifa,rt_gateway,rt_mtu rte fields.
After converting routing subsystem customers to use nexthop objects
 defined in r359823, some fields in struct rtentry became unused.

This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
 along with the code initializing and updating these fields.

Cleanup of the remaining fields will be addressed by D24669.

This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
 resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
 nexthop and tries to update rte nhop pointer. Only last part is done under
 WLOCK.
In the hypothetical scenarious where multiple rtsock clients
 repeatedly issue RTM_CHANGE requests for the same route, route may get
 updated between read and update operation. This is addressed by retrying
 the operation multiple (3) times before returning failure back to the
 caller.

Differential Revision:	https://reviews.freebsd.org/D24666
2020-05-04 14:31:45 +00:00
Gleb Smirnoff
61664ee700 Step 4.2: start divorce of M_EXT and M_EXTPG
They have more differencies than similarities. For now there is lots
of code that would check for M_EXT only and work correctly on M_EXTPG
buffers, so still carry M_EXT bit together with M_EXTPG. However,
prepare some code for explicit check for M_EXTPG.

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:37:16 +00:00
Gleb Smirnoff
6edfd179c8 Step 4.1: mechanically rename M_NOMAP to M_EXTPG
Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:21:11 +00:00
Gleb Smirnoff
7b6c99d08d Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:12:56 +00:00
Richard Scheffenegger
14558b9953 Introduce a lower bound of 2 MSS to TCP Cubic.
Running TCP Cubic together with ECN could end up reducing cwnd down to 1 byte, if the
receiver continously sets the ECE flag, resulting in very poor transmission speeds.

In line with RFC6582 App. B, a lower bound of 2 MSS is introduced, as well as a typecast
to prevent any potential integer overflows during intermediate calculation steps of the
adjusted cwnd.

Reported by:	Cheng Cui
Reviewed by:	tuexen (mentor)
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23353
2020-04-30 11:11:28 +00:00
Richard Scheffenegger
9028b6e0d9 Prevent premature shrinking of the scaled receive window
which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets.

Packets with old sequence numbers are ignored and not used to update the send window size.
This might cause the TCP session to hang indefinitely under some circumstances.

Reported by:	Cui Cheng
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24515
2020-04-29 22:01:33 +00:00
Richard Scheffenegger
b2ade6b166 Correctly set up the initial TCP congestion window in all cases,
by not including the SYN bit sequence space in cwnd related calculations.
Snd_und is adjusted explicitly in all cases, outside the cwnd update, instead.

This fixes an off-by-one conformance issue with regular TCP sessions not
using Appropriate Byte Counting (RFC3465), sending one more packet during
the initial window than expected.

PR:		235256
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	3 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D19000
2020-04-29 21:48:52 +00:00
Alexander V. Chernikov
e7d8af4f65 Move route_temporal.c and route_var.h to net/route.
Nexthop objects implementation, defined in r359823,
 introduced sys/net/route directory intended to hold all
 routing-related code. Move recently-introduced route_temporal.c and
 private route_var.h header there.

Differential Revision:	https://reviews.freebsd.org/D24597
2020-04-28 19:14:09 +00:00
Alexander V. Chernikov
4043ee3cd7 Convert rtalloc_mpath_fib() users to the new KPI.
New fib[46]_lookup() functions support multipath transparently.
Given that, switch the last rtalloc_mpath_fib() calls to
 dib4_lookup() and eliminate the function itself.

Note: proper flowid generation (especially for the outbound traffic) is a
 bigger topic and will be handled in a separate review.
This change leaves flowid generation intact.

Differential Revision:	https://reviews.freebsd.org/D24595
2020-04-28 08:06:56 +00:00
Alexander V. Chernikov
1b0051bada Eliminate now-unused parts of old routing KPI.
r360292 switched most of the remaining routing customers to a new KPI,
 leaving a bunch of wrappers for old routing lookup functions unused.

Remove them from the tree as a part of routing cleanup.

Differential Revision:	https://reviews.freebsd.org/D24569
2020-04-28 07:25:34 +00:00
John Baldwin
f1f9347546 Initial support for kernel offload of TLS receive.
- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
  authentication algorithms and keys as well as the initial sequence
  number.

- When reading from a socket using KTLS receive, applications must use
  recvmsg().  Each successful call to recvmsg() will return a single
  TLS record.  A new TCP control message, TLS_GET_RECORD, will contain
  the TLS record header of the decrypted record.  The regular message
  buffer passed to recvmsg() will receive the decrypted payload.  This
  is similar to the interface used by Linux's KTLS RX except that
  Linux does not return the full TLS header in the control message.

- Add plumbing to the TOE KTLS interface to request either transmit
  or receive KTLS sessions.

- When a socket is using receive KTLS, redirect reads from
  soreceive_stream() into soreceive_generic().

- Note that this interface is currently only defined for TLS 1.1 and
  1.2, though I believe we will be able to reuse the same interface
  and structures for 1.3.
2020-04-27 23:17:19 +00:00
John Baldwin
ec1db6e13d Add the initial sequence number to the TLS enable socket option.
This will be needed for KTLS RX.

Reviewed by:	gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D24451
2020-04-27 22:31:42 +00:00
Randall Stewart
e570d231f4 This change does a small prepratory step in getting the
latest rack and bbr in from the NF repo. When those come
in the OOB data handling will be fixed where Skyzaller crashes.

Differential Revision:	https://reviews.freebsd.org/D24575
2020-04-27 16:30:29 +00:00
Alexander V. Chernikov
55f57ca9ac Convert debugnet to the new routing KPI.
Introduce new fib[46]_lookup_debugnet() functions serving as a
special interface for the crash-time operations. Underlying
implementation will try to return lookup result if
datastructures are not corrupted, avoding locking.

Convert debugnet to use fib4_lookup_debugnet() and switch it
to use nexthops instead of rtentries.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D24555
2020-04-26 18:42:38 +00:00
Alexander V. Chernikov
17cb6ddba8 Fix order of arguments in fib[46]_lookup calls in SCTP.
r360292 introduced the wrong order, resulting in returned
 nhops not being referenced, despite the fact that references
 were requested. That lead to random GPF after using SCTP sockets.

Special defined macro like IPV[46]_SCOPE_GLOBAL will be introduced
 soon to reduce the chance of putting arguments in wrong order.

Reported-by: syzbot+5c813c01096363174684@syzkaller.appspotmail.com
2020-04-26 13:02:42 +00:00
Alexander V. Chernikov
454d389645 Fix LINT build #2 after r360292.
Pointyhat to: melifaro
2020-04-25 11:35:38 +00:00
Alexander V. Chernikov
ac99fd86d4 Fix LINT build broken by r360292. 2020-04-25 10:31:56 +00:00
Alexander V. Chernikov
983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Alexander V. Chernikov
9e88f47c8f Unbreak LINT-NOINET[6] builds broken in r360191.
Reported by:	np
2020-04-23 06:55:33 +00:00
Michael Tuexen
8262311cbe Improve input validation when processing AUTH chunks.
Thanks to Natalie Silvanovich from Google for finding and reporting the
issue found by her in the SCTP userland stack.

MFC after:		3 days
X-MFC with:		https://svnweb.freebsd.org/changeset/base/360193
2020-04-22 21:22:33 +00:00
Michael Tuexen
97feba891d Improve input validation when processing AUTH chunks.
Thanks to Natalie Silvanovich from Google for finding and reporting the
issue found by her in the SCTP userland stack.

MFC after:		3 days
2020-04-22 12:47:46 +00:00
Alexander V. Chernikov
8d6708ba80 Convert TOE routing lookups to the new routing KPI.
Reviewed by:	np
Differential Revision:	https://reviews.freebsd.org/D24388
2020-04-22 07:53:43 +00:00
Richard Scheffenegger
bb410f9ff2 revert rS360143 - Correctly set up initial cwnd
due to syzkaller panics found

Reported by:	tuexen
Approved by:	tuexen (mentor)
Sponsored by:	NetApp, Inc.
2020-04-22 00:16:42 +00:00
Richard Scheffenegger
73b7696693 Correctly set up the initial TCP congestion window
in all cases, by adjust snd_una right after the
connection initialization, to include the one byte
in sequence space occupied by the SYN bit.

This does not change the regular ACK processing,
while making the BYTES_THIS_ACK macro to work properly.

PR:		235256
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D19000
2020-04-21 13:05:44 +00:00
Jonathan T. Looney
5d6e356cb0 Avoid calling protocol drain routines more than once per reclamation event.
mb_reclaim() calls the protocol drain routines for each protocol in each
domain. Some protocols exist in more than one domain and share drain
routines. In the case of SCTP, it also uses the same drain routine for
its SOCK_SEQPACKET and SOCK_STREAM entries in the same domain.

On systems with INET, INET6, and SCTP all defined, mb_reclaim() calls
sctp_drain() four times. On systems with INET and INET6 defined,
mb_reclaim() calls tcp_drain() twice. mb_reclaim() is the only in-tree
caller of the pr_drain protocol entry.

Eliminate this duplication by ensuring that each pr_drain routine is only
specified for one protocol entry in one domain.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24418
2020-04-16 20:17:24 +00:00
Alexander V. Chernikov
539642a29d Add nhop parameter to rti_filter callback.
One of the goals of the new routing KPI defined in r359823 is to
 entirely hide`struct rtentry` from the consumers. It will allow to
 improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
 struct rtentry field accesses.

Additionally, with the followup multipath changes, single rtentry can point
 to multiple nexthops.

With that in mind, convert rti_filter callback used when traversing the
 routing table to accept pair (rt, nhop) instead of nexthop.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24440
2020-04-16 17:20:18 +00:00
Richard Scheffenegger
d7ca3f780d Reduce default TCP delayed ACK timeout to 40ms.
Reviewed by:	kbowling, tuexen
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23281
2020-04-16 15:59:23 +00:00
Alexander V. Chernikov
9ac7c6cfed Convert IP/IPv6 forwarding, ICMP processing and IP PCB laddr selection to
the new routing KPI.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24245
2020-04-14 23:06:25 +00:00
Michael Tuexen
b89af8e16d Improve the TCP blackhole detection. The principle is to reduce the
MSS in two steps and try each candidate two times. However, if two
candidates are the same (which is the case in TCP/IPv6), this candidate
was tested four times. This patch ensures that each candidate actually
reduced the MSS and is only tested 2 times. This reduces the time window
of missclassifying a temporary outage as an MTU issue.

Reviewed by:		jtl
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24308
2020-04-14 16:35:05 +00:00
Andrew Gallatin
23feb56348 KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by:	hselasky, jhb, rrs, glebius (early version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24213
2020-04-14 14:46:06 +00:00
Alexander V. Chernikov
6722086045 Plug netmask NULL check during route addition causing kernel panic.
This bug was introduced by the r359823.

Reported by:	hselasky
2020-04-14 13:12:22 +00:00
Kristof Provost
1d126e9b94 carp: Widen epoch coverage
Fix panics related to calling code which expects to be running inside
the NET_EPOCH from outside that epoch.

This leads to panics (with INVARIANTS) such as this one:

    panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/netinet/if_ether.c:373
    cpuid = 7
    time = 1586095719
    KDB: stack backtrace:
    db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0090819700
    vpanic() at vpanic+0x182/frame 0xfffffe0090819750
    panic() at panic+0x43/frame 0xfffffe00908197b0
    arprequest_internal() at arprequest_internal+0x59e/frame 0xfffffe00908198c0
    arp_announce_ifaddr() at arp_announce_ifaddr+0x20/frame 0xfffffe00908198e0
    carp_master_down_locked() at carp_master_down_locked+0x10d/frame 0xfffffe0090819910
    carp_master_down() at carp_master_down+0x79/frame 0xfffffe0090819940
    softclock_call_cc() at softclock_call_cc+0x13f/frame 0xfffffe00908199f0
    softclock() at softclock+0x7c/frame 0xfffffe0090819a20
    ithread_loop() at ithread_loop+0x279/frame 0xfffffe0090819ab0
    fork_exit() at fork_exit+0x80/frame 0xfffffe0090819af0
    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0090819af0
    --- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Widen the NET_EPOCH to cover the relevant (callback / task) code.

Differential Revision:	https://reviews.freebsd.org/D24302
2020-04-12 16:09:21 +00:00
Alexander V. Chernikov
a666325282 Introduce nexthop objects and new routing KPI.
This is the foundational change for the routing subsytem rearchitecture.
 More details and goals are available in https://reviews.freebsd.org/D24141 .

This patch introduces concept of nexthop objects and new nexthop-based
 routing KPI.

Nexthops are objects, containing all necessary information for performing
 the packet output decision. Output interface, mtu, flags, gw address goes
 there. For most of the cases, these objects will serve the same role as
 the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
 multiple BGP full-views, as these objects will be shared between routing
 entries. This allows to store more information in the nexthop.

New KPI:

struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);

These 2 function are intended to replace all all flavours of
 <in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions  and the previous
 fib[46]-generation functions.

Upon successful lookup, they return nexthop object which is guaranteed to
 exist within current NET_EPOCH. If longer lifetime is desired, one can
 specify NHR_REF as a flag and get a referenced version of the nexthop.
 Reference semantic closely resembles rtentry one, allowing sed-style conversion.

Additionally, another 2 functions are introduced to support uRPF functionality
 inside variety of our firewalls. Their primary goal is to hide the multipath
 implementation details inside the routing subsystem, greatly simplifying
 firewalls implementation:

int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);

All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
 embedding and allowing to support IPv4 link-locals in the future.

Structure changes:
 * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
 * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.

Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
 decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
 kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.

More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232

Reviewed by:	ae,glebius(initial version)
Differential Revision:	https://reviews.freebsd.org/D24232
2020-04-12 14:30:00 +00:00
Michael Tuexen
07ddae2822 Revert https://svnweb.freebsd.org/changeset/base/359809
The intended change was
	sp->next.tqe_next = NULL;
	sp->next.tqe_prev = NULL;
which doesn't fix the issue I'm seeing and the committed fix is
not the intended fix due to copy-and-paste.

Thanks a lot to Conrad Meyer for making me aware of the problem.

Reported by:		cem
2020-04-12 09:31:36 +00:00
Michael Tuexen
9803dbb3ea Zero out pointers for consistency.
This was found by running syzkaller on an INVARIANTS kernel.

MFC after:		3 days
2020-04-11 20:36:54 +00:00
Alexander V. Chernikov
4684d3cbcb Remove per-AF radix_mpath initializtion functions.
Split their functionality by moving random seed allocation
 to SYSINIT and calling (new) generic multipath function from
 standard IPv4/IPv5 RIB init handlers.

Differential Revision:	https://reviews.freebsd.org/D24356
2020-04-11 07:37:08 +00:00
Warner Losh
28540ab153 Fix copyright year and eliminate the obsolete all rights reserved line.
Reviewed by: rrs@
2020-04-08 17:55:45 +00:00
Michael Tuexen
f4cb790a35 Do more argument validation under INVARIANTS when starting/stopping
an SCTP timer.

MFC after:		1 week
2020-04-06 13:58:13 +00:00
Alexander V. Chernikov
66bc03d415 Use interface fib for proxyarp checks.
Before the change, proxyarp checks for src and dst addresses
 were performed using default fib, breaking multi-fib scenario.

PR:		245181
Submitted by:	Scott Aitken (original version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24244
2020-04-02 20:06:37 +00:00
Michael Tuexen
413c3db101 Allow the TCP backhole detection to be disabled at all, enabled only
for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6.
The current blackhole detection might classify a temporary outage as
an MTU issue and reduces permanently the MSS. Since the consequences of
such a reduction due to a misclassification are much more drastically
for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only.

Reviewed by:		bcr@ (man page), Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24219
2020-03-31 15:54:54 +00:00
Mark Johnston
9b1d850be8 Remove the "config" taskqgroup and its KPIs.
Equivalent functionality is already provided by taskqueue(9), just use
that instead.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-03-30 14:24:03 +00:00
Michael Tuexen
9aca687811 Small cleanup by using a variable just assigned.
MFC after:		1 week
2020-03-28 22:35:04 +00:00
Michael Tuexen
25ec355353 Handle integer overflows correctly when converting msecs and secs to
ticks and vice versa.
These issues were caught by recently added panic() calls on INVARIANTS
systems.

Reported by:		syzbot+b44787b4be7096cd1590@syzkaller.appspotmail.com
Reported by:		syzbot+35f82d22805c1e899685@syzkaller.appspotmail.com
MFC after:		1 week
2020-03-28 20:25:45 +00:00
Ed Maste
c012cfe68a sys/netinet: remove spurious doubled ;s 2020-03-27 23:10:18 +00:00
Michael Tuexen
d5d190f2f9 Some more uint32_t cleanups, no functional change.
MFC after:		1 week
2020-03-27 21:48:52 +00:00
Michael Tuexen
239e5865df Use uint32_t where it is expected to be used. No functional change.
MFC after:		1 week
2020-03-27 11:08:11 +00:00
Michael Tuexen
7c63520c42 Remove an optimization, which was incorrect a couple of times and
therefore doesn't seem worth to be there.
In this case COOKIE where not retransmitted anymore, when the
socket was already closed.

MFC after:		1 week
2020-03-25 18:20:37 +00:00
Michael Tuexen
37686ccf08 Improve consistency in debug output.
MFC after:		1 week
2020-03-25 18:14:12 +00:00
Michael Tuexen
24187cfe72 Revert https://svnweb.freebsd.org/changeset/base/357829
This introduces a regression reported by koobs@ when running a pyhton
test suite on a loaded system.

This patch resulted in a failing accept() call, when the association
was setup and gracefully shutdown by the peer before accept was called.
So the following packetdrill script would fail:

+0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.0 bind(3, ..., ...) = 0
+0.0 listen(3, 1) = 0
+0.0 < sctp: INIT[flgs=0, tag=1, a_rwnd=15000, os=1, is=1, tsn=1]
+0.0 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=1, ...]
+0.1 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.0 > sctp: COOKIE_ACK[flgs=0]
+0.0 < sctp: DATA[flgs=BE, len=116, tsn=1, sid=0, ssn=0, ppid=0]
+0.0 > sctp: SACK[flgs=0, cum_tsn=1, a_rwnd=..., gaps=[], dups=[]]
+0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0]
+0.0 > sctp: SHUTDOWN_ACK[flgs=0]
+0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0]
+0.0 accept(3, ..., ...) = 4
+0.0 close(3) = 0
+0.0 recv(4, ..., 4096, 0) = 100
+0.0 recv(4, ..., 4096, 0) = 0
+0.0 close(4) = 0

Reported by:		koops@
2020-03-25 15:29:01 +00:00
Michael Tuexen
23e3c0880d Use consistent debug output.
MFC after:		1 week
2020-03-25 13:19:41 +00:00
Michael Tuexen
e056fafd92 Don't restore the vnet too early in error cases.
MFC after:		1 week
2020-03-25 13:18:37 +00:00
Michael Tuexen
7522682e5e Only call panic when building with INVARIANTS.
MFC after:		1 week
2020-03-24 23:04:07 +00:00
Michael Tuexen
a412576e36 Another cleanup of the timer code. Also be more pedantic about the
parameters of the timer start and stop routines. Several inconsistencies
have been fixed in earlier commits. Now they will be catched when running
an INVARIANTS system.

MFC after:	1 week
2020-03-24 22:44:36 +00:00
Michael Tuexen
d084818d9d Cleanup the file and add two ASSERT variants for locks, which will be
used shortly.

MFC after:		1 week
2020-03-23 12:17:13 +00:00
Michael Tuexen
a57fb68b92 More timer cleanups, no functional change.
MFC after:		1 week
2020-03-21 16:12:19 +00:00
Michael Tuexen
fa8ceba9ca Remove a set, but unused variable.
MFC after:		1 week
2020-03-20 14:49:44 +00:00
Michael Tuexen
2bdebd0ce3 A a missing NET_EPOCH_ENTER/NET_EPOCH_EXIT pair. This was affecting
implicit connection setups via sendmsg().

Reported by:		syzbot+febbe3383a0e9b700c1b@syzkaller.appspotmail.com
Reported by:		syzbot+dca98631455d790223ca@syzkaller.appspotmail.com
Reported by:		syzbot+5a71a7760d6bcf11b8cd@syzkaller.appspotmail.com
Reported by:		syzbot+da64217e140444c49f00@syzkaller.appspotmail.com
2020-03-19 23:07:52 +00:00
Michael Tuexen
6fb7b4fbdb Consistently provide arguments for timer start and stop routines.
This is another step in cleaning up timer handling.
MFC after:		1 week
2020-03-19 21:01:16 +00:00
Michael Tuexen
e95b3d7faf Cleanup the stream reset and asconf timer.
MFC after:		1 week
2020-03-19 18:55:54 +00:00
Michael Tuexen
42078d5ada The MTU candidates MUST be a multiple of 4, so make them so.
MFC after:		1 week
2020-03-19 14:37:28 +00:00
Michael Tuexen
0554e01d8b Handle the timers in a consistent sequence according to the definition
of the timer type.
Just a cleanup, no functional change intended.

MFC after:		1 week
2020-03-17 19:20:12 +00:00
Andrew Gallatin
ee7a9e506e Avoid a cache miss accessing an mbuf ext_pgs pointer when doing SW kTLS.
For a Netflix 90Gb/s 100% TLS software kTLS workload, this reduces
the CPI of tcp_m_copym() from ~3.5 to ~2.5 as reported by vtune.

Reviewed by:	jtl, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23998
2020-03-16 14:03:27 +00:00
Michael Tuexen
7ca6e2963f Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since
these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that
instead of TCPSTAT_ADD.

Reviewed by:		jtl@, rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23904
2020-03-12 15:37:41 +00:00
Andrew Gallatin
98085bae8c make lacp's use_numa hashing aware of send tags
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
     lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
     always called by lacp_select_tx_port()

-   pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

-   adds a numa_domain field to if_snd_tag_alloc_params

-   plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by:	rrs, hselasky, jhb
Sponsored by:	Netflix
2020-03-09 13:44:51 +00:00
Hiroki Sato
d726e6331b Fix an issue of net.inet.igmp.stats handler.
The header of (struct igmpstat) could be cleared by sysctl(3).
This can be reproduced by "netstat -s -z -p igmp".

PR:	244584
MFC after:	1 week
2020-03-07 08:41:10 +00:00
Michael Tuexen
9c04fdfd34 When using automatically generated flow labels and using TCP SYN
cookies, use the same flow label for the segments sent during the
handshake and after the handshake.
This fixes a bug by making sure that sc_flowlabel is always stored in
network byte order.

Reviewed by:		bz@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23957
2020-03-04 16:41:25 +00:00
Bjoern A. Zeeb
d2b8fd0da1 Add new ICMPv6 counters for Anti-DoS limits.
Add four new counters for ND6 related Anti-DoS measures.
We split these out into a separate upfront commit so that we only
change the struct size one time.  Implementations using them will
follow.

PR:		157410
Reviewed by:	melifaro
MFC after:	2 weeks
X-MFC:		cannot really MFC this without breaking netstat
Sponsored by:	Netflix (initially)
Differential Revision:	https://reviews.freebsd.org/D22711
2020-03-04 16:20:59 +00:00
Michael Tuexen
6605e5791f Don't send an uninitilised traffic class in the IPv6 header, when
sending a TCP segment from the TCP SYN cache (like a SYN-ACK).
This fix initialises it to zero. This is correct for the ECN bits,
but is does not honor the DSCP what an application might have set via
the IPPROTO_IPV6 level socket options IPV6_TCLASS. That will be
fixed separately.

Reviewed by:		Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D23900
2020-03-04 12:22:53 +00:00
Bjoern A. Zeeb
4e1a3ff884 tcp_hpts: make RSS kernel compile again.
Add proper #includes, and #ifdefs and some style fixes to make RSS
kernels compile again.  There are still possible issues with uin16_t
vs. uint_t cpuid which I am not going near.

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D23726
2020-03-03 14:15:30 +00:00
Michael Tuexen
7e1e491f60 Remove stale definitions. The removed definitions are not used right
now and are incompatible with the correct ones in RFC 3168.

Submitted by:		Richard Scheffenegger
Differential Revision:	https://reviews.freebsd.org/D23903
2020-03-01 12:34:27 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Randall Stewart
d7313dc6f5 This commit expands tcp_ratelimit to be able to handle cards
like the mlx-c5 and c6 that require a "setup" routine before
the tcp_ratelimit code can declare and use a rate. I add the
setup routine to if_var as well as fix tcp_ratelimit to call it.
I also revisit the rates so that in the case of a mlx card
of type c5/6 we will use about 100 rates concentrated in the range
where the most gain can be had (1-200Mbps). Note that I have
tested these on a c5 and they work and perform well. In fact
in an unloaded system they pace right to the correct rate (great
job mlx!). There will be a further commit here from Hans that
will add the respective changes to the mlx driver to support this
work (which I was testing with).

Sponsored by:	Netflix Inc.
Differential Revision:	ttps://reviews.freebsd.org/D23647
2020-02-26 13:48:33 +00:00
Pawel Biernacki
295a18d184 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (14 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Approved by:	kib (mentor, blanket)
Differential Revision:	https://reviews.freebsd.org/D23639
2020-02-24 10:47:18 +00:00
Pawel Biernacki
10b49b2302 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (6 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

Mark all nodes in pf, pfsync and carp as MPSAFE.

Reviewed by:	kp
Approved by:	kib (mentor, blanket)
Differential Revision:	https://reviews.freebsd.org/D23634
2020-02-21 16:23:00 +00:00
Michael Tuexen
64f29eb1df Remove an unused timer type.
MFC after:		1 week
2020-02-20 15:37:44 +00:00
Michael Tuexen
868b51f234 Epochify SCTP. 2020-02-18 21:25:17 +00:00
Michael Tuexen
ba0d525006 Remove unused function. 2020-02-18 19:41:55 +00:00
Michael Tuexen
a610bb2120 Fix the non-default stream schedulers such that do not interleave
user messages when it is now allowed.

Thanks to Christian Wright for reporting the issue for the userland
stack and providing a fix for the priority scheduler.

MFC after:		1 week
2020-02-17 18:05:03 +00:00
Michael Tuexen
6b8fba3c5c Don't use uninitialised stack memory if the sysctl variable
net.inet.tcp.hostcache.enable is set to 0.
The bug resulted in using possibly a too small MSS value or wrong
initial retransmission timer settings. Possibly the value used
for ssthresh was also wrong.

Submitted by:		Richard Scheffenegger
Reviewed by:		Cheng Cui, rgrimes@, tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23687
2020-02-17 14:54:21 +00:00
Hans Petter Selasky
bacb11c9ed Fix kernel panic while trying to read multicast stream.
When VIMAGE is enabled make sure the "m_pkthdr.rcvif" pointer is set
for all mbufs being input by the IGMP/MLD6 code. Else there will be a
NULL-pointer dereference in the netisr code when trying to set the
VNET based on the incoming mbuf. Add an assert to catch this when
queueing mbufs on a netisr to make debugging of similar cases easier.

Found by:	Vladislav V. Prodan
PR:		244002
Reviewed by:	bz@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2020-02-17 09:46:32 +00:00
Mateusz Guzik
6b25673f3f sctp: use new capsicum helpers 2020-02-15 01:29:40 +00:00
Michael Tuexen
a357466592 sack_newdata and snd_recover hold the same value. Therefore, use only
a single instance: use snd_recover also where sack_newdata was used.

Submitted by:		Richard Scheffenegger
Differential Revision:	https://reviews.freebsd.org/D18811
2020-02-13 15:14:46 +00:00
Michael Tuexen
33f8cfdfe4 Whitespace cleanup. No functional change.
Sponsored by:		Netflix, Inc.
2020-02-13 13:58:34 +00:00
Michael Tuexen
56ccb48fd6 Don't panic under INVARIANTS when we can't allocate memory for storing
a vtag in time wait.
This issue was found by running syzkaller.

MFC after:		1 week
2020-02-12 17:05:10 +00:00
Michael Tuexen
ca3de626ec Mark the socket as disconnected when freeing the association the first
time.
This issue was found by running syzkaller.

MFC after:		1 week
2020-02-12 17:02:15 +00:00
Randall Stewart
348404bce1 Lets get the real correct version.. gessh. I need
more coffee evidently.

Sponsored by:	Netflix
2020-02-12 15:26:56 +00:00
Randall Stewart
b8f8a6b719 Opps committed the wrong ratelimit version in the
whitespace cleanup.. Restore it to the proper version.

Sponsored by:	Netfilx Inc.
2020-02-12 13:37:53 +00:00
Randall Stewart
481be5de9d White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by:	Netflix Inc.
2020-02-12 13:31:36 +00:00
Randall Stewart
df341f5986 Whitespace, remove from three files trailing white
space (leftover presents from emacs).

Sponsored by:	Netflix Inc.
2020-02-12 13:07:09 +00:00
Randall Stewart
596ae436ef This small fix makes it so we properly follow
the RFC and only enable ECN when both the
CWR and ECT bits our set within the SYN packet.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D23645
2020-02-12 13:04:19 +00:00
Randall Stewart
3fba40d9f2 Remove all trailing white space from the BBR/Rack fold. Bits
left around by emacs (thanks emacs).
2020-02-12 12:40:06 +00:00
Randall Stewart
d2517ab04b Now that all of the stats framework is
in FreeBSD the bits that disabled stats
when netflix-stats is not defined is no longer
needed. Lets remove these bits so that we
will properly use stats per its definition
in BBR and Rack.

Sponsored by:	Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D23088
2020-02-12 12:36:55 +00:00
Michael Tuexen
8803350d6d Revert https://svnweb.freebsd.org/changeset/base/357761
This was suggested by cem@
2020-02-11 20:02:20 +00:00
Michael Tuexen
9803f01cdb Don't start an SCTP timer using a net, which has been removed.
Submitted by:		Taylor Brandstetter
MFC after:		1 week
2020-02-11 18:15:57 +00:00
Michael Tuexen
95d27478d2 Use an int instead of a bool variable, since bool is not supported
on all platforms the stack is running on in userland.
2020-02-11 14:00:27 +00:00
Michael Tuexen
6a34ec63ab Stop the PMTU and HB timer when removing a net, not when freeing it.
Submitted by:		Taylor Brandstetter
MFC after:		1 week
2020-02-09 22:40:05 +00:00
Michael Tuexen
5555400aa5 Cleanup timer handling.
Submitted by:	Taylor Brandstetter
MFC after:	1 week
2020-02-09 22:05:41 +00:00
Ed Maste
5aa0576b33 Miscellaneous typo fixes
Submitted by:	Gordon Bergling <gbergling_gmail.com>
Differential Revision:	https://reviews.freebsd.org/D23453
2020-02-07 19:53:07 +00:00
Michael Tuexen
f799ff82fb Remove unused timer.
Submitted by:		Taylor Brandstetter
2020-02-04 14:01:07 +00:00
Michael Tuexen
bbf9f080e9 Improve numbering of debug information.
Submitted by:		Taylor Brandstetter
MFC after:		1 week
2020-02-04 12:34:16 +00:00
Conrad Meyer
8e6b06be14 netinet/libalias: Fix typo in debug message
No functional change.

PR:		243831
Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D23365
2020-02-03 05:19:44 +00:00
Gleb Smirnoff
42ce79378d Fix missing NET_EPOCH_ENTER() when compiled with TCP_OFFLOAD.
Reported by:	Coverity
CID:		1413162
2020-01-29 22:48:18 +00:00
Michael Tuexen
dc13edbc7d Fix build issues for the userland stack on 32-bit platforms.
Reported by:		Felix Weinrank
MFC after:		1 week
2020-01-28 10:09:05 +00:00
Alexander V. Chernikov
75831a1c95 Fix NOINET6 build after r357038.
Reported by:	AN <andy at neu.net>
2020-01-26 11:54:21 +00:00
Michael Tuexen
9cc711c9ff Sending CWR after an RTO is according to RFC 3168 generally required
and not only for the DCTCP congestion control.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes, tuexen@, Cheng Cui
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23119
2020-01-25 13:45:10 +00:00
Michael Tuexen
47e2c17c12 Don't set the ECT codepoint on retransmitted packets during SACK loss
recovery. This is required by RFC 3168.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@, Cheng Cui
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23118
2020-01-25 13:34:29 +00:00
Michael Tuexen
a2d59694be As a TCP client only enable ECN when the corresponding sysctl variable
indicates that ECN should be negotiated for the client side.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D23228
2020-01-25 13:11:14 +00:00
Michael Tuexen
ee97681e5c Don't delay the ACK for a TCP segment with the CWR flag set.
This allows the data sender to increase the CWND faster.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@, Cheng Cui
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D22670
2020-01-24 22:50:23 +00:00