Commit Graph

6756 Commits

Author SHA1 Message Date
Michael Tuexen
283c76c7c3 RFC 7323 specifies that:
* TCP segments without timestamps should be dropped when support for
  the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
  for the timestamp option has not been negotiated.
This patch enforces the above.

PR:			250499
Reviewed by:		gnn, rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc
Differential Revision:	https://reviews.freebsd.org/D27148
2020-11-09 21:49:40 +00:00
Michael Tuexen
e597bae4ee Fix a potential use-after-free bug introduced in
https://svnweb.freebsd.org/changeset/base/363046

Thanks to Taylor Brandstetter for finding this issue using fuzz testing
and reporting it in https://github.com/sctplab/usrsctp/issues/547
2020-11-09 13:12:07 +00:00
Mitchell Horne
b02c4e5c78 igmp: convert igmpstat to use PCPU counters
Currently there is no locking done to protect this structure. It is
likely okay due to the low-volume nature of IGMP, but allows for
the possibility of underflow. This appears to be one of the only
holdouts of the conversion to counter(9) which was done for most
protocol stat structures around 2013.

This also updates the visibility of this stats structure so that it can
be consumed from elsewhere in the kernel, consistent with the vast
majority of VNET_PCPUSTAT structures.

Reviewed by:	kp
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27023
2020-11-08 18:49:23 +00:00
Richard Scheffenegger
4d0770f172 Prevent premature SACK block transmission during loss recovery
Under specific conditions, a window update can be sent with
outdated SACK information. Some clients react to this by
subsequently delaying loss recovery, making TCP perform very
poorly.

Reported by:	chengc_netapp.com
Reviewed by:	rrs, jtl
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D24237
2020-11-08 18:47:05 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
John Baldwin
98d7a8d9cd Call m_snd_tag_rele() to free send tags.
Send tags are refcounted and if_snd_tag_free() is called by
m_snd_tag_rele() when the last reference is dropped on a send tag.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26995
2020-10-29 22:18:56 +00:00
John Baldwin
7552deb2a0 Remove an extra if_ref().
In r348254, if_snd_tag_alloc() routines were changed to bump the ifp
refcount via m_snd_tag_init().  This function wasn't in the tree at
the time and wasn't updated for the new semantics, so was still doing
a separate bump after if_snd_tag_alloc() returned.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26999
2020-10-29 22:16:59 +00:00
John Baldwin
aebfdc1fec Store the new send tag in the right place.
r350501 added the 'st' parameter, but did not pass it down to
if_snd_tag_alloc().

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26997
2020-10-29 22:14:34 +00:00
John Baldwin
521eac97f3 Support hardware rate limiting (pacing) with TLS offload.
- Add a new send tag type for a send tag that supports both rate
  limiting (packet pacing) and TLS offload (mostly similar to D22669
  but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
  connection already has a pacing rate.  If so, allocate a tag that
  supports both rate limiting and TLS offload rather than a plain TLS
  offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
  set the rate in the TCP control block inp and then reset the TLS
  send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
  send tag.  This allocates the TLS send tag asynchronously from a
  task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
  send tag.  If the send tag is only a plain TLS send tag, assume we
  failed to allocate a TLS ratelimit tag (either during the
  TCP_TXTLS_ENABLE socket option, or during the send tag reset
  triggered by ktls_output_eagain) and ignore the new rate.  If the
  send tag is a ratelimit TLS send tag, change the rate on the TLS tag
  and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
  so that the routines in tcp_ratelimit can safely dereference the
  pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
  administrative controls in ifconfig(8).  TLS rate limit tags are
  only allocated if this capability is enabled.  Note that TLS offload
  (whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by:	gallatin, hselasky
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26691
2020-10-29 00:23:16 +00:00
John Baldwin
ce39811544 Save the current TCP pacing rate in t_pacing_rate.
Reviewed by:	gallatin, gnn
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26875
2020-10-29 00:03:19 +00:00
Richard Scheffenegger
3767427354 TCP Cubic: improve reaction to (and rollback from) RTO
1. fix compliancy issue of CUBIC RTO handling according to RFC8312 section 4.7
2. add CUBIC CC_RTO_ERR handling

Submitted by:	chengc_netapp.com
Reviewed by:	rrs, tuexen, rscheff
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26808
2020-10-24 16:11:46 +00:00
Richard Scheffenegger
39a12f0178 tcp: move cwnd and ssthresh updates into cc modules
This will pave the way of setting ssthresh differently in TCP CUBIC, according
to RFC8312 section 4.7.

No functional change, only code movement.

Submitted by:	chengc_netapp.com
Reviewed by:	rrs, tuexen, rscheff
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26807
2020-10-24 16:09:18 +00:00
Mark Johnston
4caea9b169 icmp6: Count packets dropped due to an invalid hop limit
Pad the icmp6stat structure so that we can add more counters in the
future without breaking compatibility again, last done in r358620.
Annotate the rarely executed error paths with __predict_false while
here.

Reviewed by:	bz, melifaro
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26578
2020-10-19 17:07:19 +00:00
Alexander V. Chernikov
0c325f53f1 Implement flowid calculation for outbound connections to balance
connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
 and outbound traffic. Current code fills mbuf flowid from inp_flowid
 for connection-oriented sockets. However, inp_flowid is currently
 not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
 for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
 system.

Reviewed by:	glebius (previous version),ae
Differential Revision:	https://reviews.freebsd.org/D26523
2020-10-18 17:15:47 +00:00
Alexander V. Chernikov
fa8b3fcb4c Simplify NET_EPOCH_EXIT in inp_join_group().
Suggested by:	kib
2020-10-18 12:03:36 +00:00
Alexander V. Chernikov
337418adf1 Fix sleepq_add panic happening with too wide net epoch in mcast control.
PR:		250413
Reported by:	Christopher Hall <hsw at bitmark.com>
Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D26827
2020-10-17 20:33:09 +00:00
Michael Tuexen
a92d501617 Improve the handling of cookie life times.
The staleness reported in an error cause is in us, not ms.
Enforce limits on the life time via sysct; and socket options
consistently. Update the description of the sysctl variable to
use the right unit. Also do some minor cleanups.
This also fixes an interger overflow issue if the peer can
modify the cookie. This was reported by Felix Weinrank by fuzz testing
the userland stack and in
https://oss-fuzz.com/testcase-detail/4800394024452096

MFC after:		3 days
2020-10-16 10:44:48 +00:00
Andrey V. Elsukov
6952c3e1ac Implement SIOCGIFALIAS.
It is lightweight way to check if an IPv4 address exists.

Submitted by:	Roy Marples
Reviewed by:	gnn, melifaro
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26636
2020-10-14 09:22:54 +00:00
Andrey V. Elsukov
3f740d4393 Join to AllHosts multicast group again when adding an existing IPv4 address.
When SIOCAIFADDR ioctl configures an IPv4 address that is already exist,
it removes old ifaddr. When this IPv4 address is only one configured on
the interface, this also leads to leaving from AllHosts multicast group.
Then an address is added again, but due to the bug, this doesn't lead
to joining to AllHosts multicast group.

Submitted by:	yannis.planus_alstomgroup.com
Reviewed by:	gnn
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D26757
2020-10-13 19:34:36 +00:00
Bjoern A. Zeeb
506512b170 ip_mroute: fix the viftable export sysctl
It seems that in r354857 I got more than one thing wrong.
Convert the SYSCTL_OPAQUE to a SYSCTL_PROC to properly export the these
days allocated and not longer static per-vnet viftable array.
This fixes a problem with netstat -g which would show bogus information
for the IPv4 Virtual Interface Table.

PR:		246626
Reported by:	Ozkan KIRIK (ozkan.kirik gmail.com)
MFC after:	3 days
2020-10-11 00:01:00 +00:00
Richard Scheffenegger
4b72ae16ed Stop sending tiny new data segments during SACK recovery
Consider the currently in-use TCP options when
calculating the amount of new data to be injected during
SACK loss recovery. That addresses the effect that very small
(new) segments could be injected on partial ACKs while
still performing a SACK loss recovery.

Reported by:	Liang Tian
Reviewed by:	tuexen, chengc_netapp.com
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26446
2020-10-09 12:44:56 +00:00
Richard Scheffenegger
868aabb470 Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26409
2020-10-09 12:06:43 +00:00
Richard Scheffenegger
5432120028 Extend netstat to display TCP stack and detailed congestion state (2)
Extend netstat to display TCP stack and detailed congestion state

Adding the "-c" option used to show detailed per-connection
congestion control state for TCP sessions.

This is one summary patch, which adds the relevant variables into
xtcpcb. As previous "spare" space is used, these changes are ABI
compatible.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26518
2020-10-09 10:55:19 +00:00
Michael Tuexen
e7a39b856a Minor cleanups.
MFC after:		3 days
2020-10-07 15:22:48 +00:00
John Baldwin
9aed26b906 Check if_capenable, not if_capabilities when enabling rate limiting.
if_capabilities is a read-only mask of supported capabilities.
if_capenable is a mask under administrative control via ifconfig(8).

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26690
2020-10-06 18:02:33 +00:00
Michael Tuexen
6f155d690b Reset delayed SACK state when restarting an SCTP association.
MFC after:		3 days
2020-10-06 14:26:05 +00:00
Michael Tuexen
b954d81662 Ensure variables are initialized before used.
MFC after:		3 days
2020-10-06 11:29:08 +00:00
Michael Tuexen
6176f9d6df Remove dead stores reported by clang static code analysis
MFC after:		3 days
2020-10-06 11:08:52 +00:00
Michael Tuexen
11daa73adc Cleanup, no functional change intended.
MFC after:		3 days
2020-10-06 10:41:04 +00:00
Michael Tuexen
c8e55b3c0c Whitespace changes.
MFC after:		3 days
2020-10-06 09:51:40 +00:00
Michael Tuexen
9f2d6263bb Use __func__ instead of __FUNCTION__ for consistency.
MFC after:		3 days
2020-10-04 15:37:34 +00:00
Michael Tuexen
d0ed75b3b1 Cleanup, no functional change intended.
MFC after:		3 days
2020-10-04 15:22:14 +00:00
Alexander V. Chernikov
fedeb08b6a Introduce scalable route multipath.
This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
 relative weights and a dataplane-optimized structure to enable
 efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
 gets compiled during group creation and is basically an array of
 nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
 nexthop or nexthop group. They are distinguished by the presense of
 NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
 should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
 This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx  NhIdx     Weight   Slots                                 Gateway     Netif  Refcnt
1         ------- ------- ------- --------------------------------------- ---------       1
              13      10       1                             2001:db8::2     vlan2
              14      20       2                             2001:db8::3     vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by:	olivier
Reviewed by:	glebius
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D26449
2020-10-03 10:47:17 +00:00
Michael Tuexen
b15f541113 Improve the input validation and processing of cookies.
This avoids setting the association in an inconsistent
state, which could result in a use-after-free situation.
This can be triggered by a malicious peer, if the peer
can modify the cookie without the local endpoint recognizing
it.
Thanks to Ned Williamson for reporting the issue.

MFC after:		3 days
2020-09-29 09:36:06 +00:00
Michael Tuexen
fbc6840bae Minor cleanup.
MFC after:		3 days
2020-09-28 14:11:53 +00:00
Michael Tuexen
1d1b4bce53 Cleanup, no functional change intended.
MFC after:		3 days
2020-09-27 13:32:02 +00:00
Michael Tuexen
8f269b8242 Improve the handling of receiving unordered and unreliable user
messages using DATA chunks. Don't use fsn_included when not being
sure that it is set to an appropriate value. If the default is
used, which is -1, this can result in SCTP associaitons not
making any user visible progress.

Thanks to Yutaka Takeda for reporting this issue for the the
userland stack in https://github.com/pion/sctp/issues/138.

MFC after:		3 days
2020-09-27 13:24:01 +00:00
Richard Scheffenegger
e399566123 TCP: send full initial window when timestamps are in use
The fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable.
That value does not consider tcp options like timestamps,
while the initial window calculation is using
the correct dynamic tcp_maxseg() function.

Due to this interaction, the last, full size segment
is considered too short and not sent out immediately.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26478
2020-09-25 10:38:19 +00:00
Richard Scheffenegger
1567c937e2 TCP newreno: improve after_idle ssthresh
Adjust ssthresh in after_idle to the maximum of
the prior ssthresh, or 3/4 of the prior cwnd. See
RFC2861 section 2 for an in depth explanation for
the rationale around this.

As newreno is the default "fall-through" reaction,
most tcp variants will benefit from this.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D22438
2020-09-25 10:23:14 +00:00
Michael Tuexen
b6db274d1e Whitespace changes.
MFC after:		3 days
2020-09-24 12:26:06 +00:00
Alexander V. Chernikov
2259a03020 Rework part of routing code to reduce difference to D26449.
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
 as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
 that rtentry can contain multiple nexthops.

Differential Revision:	https://reviews.freebsd.org/D26497
2020-09-21 20:02:26 +00:00
Alexander V. Chernikov
1440f62266 Remove unused nhop_ref_any() function.
Remove "opt_mpath.h" header where not needed.

No functional changes.
2020-09-20 21:32:52 +00:00
Mitchell Horne
374ce2488a Initialize some local variables earlier
Move the initialization of these variables to the beginning of their
respective functions.

On our end this creates a small amount of unneeded churn, as these
variables are properly initialized before their first use in all cases.
However, changing this benefits at least one downstream consumer
(NetApp) by allowing local and future modifications to these functions
to be made without worrying about where the initialization occurs.

Reviewed by:	melifaro, rscheff
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26454
2020-09-18 14:01:10 +00:00
Navdeep Parhar
b092fd6c97 if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.

Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.

Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.

Reviewed by:	kib@
Relnotes:	Yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:37:57 +00:00
Navdeep Parhar
72cc43df17 Add a knob to allow zero UDP checksums for UDP/IPv6 traffic on the given UDP port.
This will be used by some upcoming changes to if_vxlan(4).  RFC 7348 (VXLAN)
says that the UDP checksum "SHOULD be transmitted as zero.  When a packet is
received with a UDP checksum of zero, it MUST be accepted for decapsulation."
But the original IPv6 RFCs did not allow zero UDP checksum.  RFC 6935 attempts
to resolve this.

Reviewed by:	kib@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:21:15 +00:00
Michael Tuexen
42d7560796 Export the name of the congestion control. This will be used by sockstat
and netstat.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D26412
2020-09-13 09:06:50 +00:00
Richard Scheffenegger
e74e64a191 cc_mod: remove unused CCF_DELACK definition
During the DCTCP improvements, use of CCF_DELACK was
removed. This change is just to rename the unused flag
bit to prevent use of it, without also re-implementing
the tcp_input and tcp_output interfaces.

No functional change.

Reviewed by:	chengc_netapp.com, tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26181
2020-09-10 00:46:38 +00:00
Randall Stewart
285385ba56 So it turns out that syzkaller hit another crash. It has to do with switching
stacks with a SENT_FIN outstanding. Both rack and bbr will only send a
FIN if all data is ack'd so this must be enforced. Also if the previous stack
sent the FIN we need to make sure in rack that when we manufacture the
"unknown" sends that we include the proper HAS_FIN bits.

Note for BBR we take a simpler approach and just refuse to switch.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D26269
2020-09-09 11:11:50 +00:00
Bjoern A. Zeeb
67d224ef43 bbr: remove unused static function
bbr_log_type_hrdwtso() is a file local static unused function.
Remove it to avoid warnings on kernel compiles.

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D26331
2020-09-05 00:20:32 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00