The staleness reported in an error cause is in us, not ms.
Enforce limits on the life time via sysct; and socket options
consistently. Update the description of the sysctl variable to
use the right unit. Also do some minor cleanups.
This also fixes an interger overflow issue if the peer can
modify the cookie. This was reported by Felix Weinrank by fuzz testing
the userland stack and in
https://oss-fuzz.com/testcase-detail/4800394024452096
MFC after: 3 days
It is lightweight way to check if an IPv4 address exists.
Submitted by: Roy Marples
Reviewed by: gnn, melifaro
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26636
When SIOCAIFADDR ioctl configures an IPv4 address that is already exist,
it removes old ifaddr. When this IPv4 address is only one configured on
the interface, this also leads to leaving from AllHosts multicast group.
Then an address is added again, but due to the bug, this doesn't lead
to joining to AllHosts multicast group.
Submitted by: yannis.planus_alstomgroup.com
Reviewed by: gnn
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D26757
It seems that in r354857 I got more than one thing wrong.
Convert the SYSCTL_OPAQUE to a SYSCTL_PROC to properly export the these
days allocated and not longer static per-vnet viftable array.
This fixes a problem with netstat -g which would show bogus information
for the IPv4 Virtual Interface Table.
PR: 246626
Reported by: Ozkan KIRIK (ozkan.kirik gmail.com)
MFC after: 3 days
Consider the currently in-use TCP options when
calculating the amount of new data to be injected during
SACK loss recovery. That addresses the effect that very small
(new) segments could be injected on partial ACKs while
still performing a SACK loss recovery.
Reported by: Liang Tian
Reviewed by: tuexen, chengc_netapp.com
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26446
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.
Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting
Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26409
Extend netstat to display TCP stack and detailed congestion state
Adding the "-c" option used to show detailed per-connection
congestion control state for TCP sessions.
This is one summary patch, which adds the relevant variables into
xtcpcb. As previous "spare" space is used, these changes are ABI
compatible.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26518
if_capabilities is a read-only mask of supported capabilities.
if_capenable is a mask under administrative control via ifconfig(8).
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26690
This change is based on the nexthop objects landed in D24232.
The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.
Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.
With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.
User-visible changes:
The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.
All routes now comes with weight, default weight is 1, maximum is 2^24-1.
Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.
Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1
route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20
netstat -6On
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2
Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default
Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449
This avoids setting the association in an inconsistent
state, which could result in a use-after-free situation.
This can be triggered by a malicious peer, if the peer
can modify the cookie without the local endpoint recognizing
it.
Thanks to Ned Williamson for reporting the issue.
MFC after: 3 days
messages using DATA chunks. Don't use fsn_included when not being
sure that it is set to an appropriate value. If the default is
used, which is -1, this can result in SCTP associaitons not
making any user visible progress.
Thanks to Yutaka Takeda for reporting this issue for the the
userland stack in https://github.com/pion/sctp/issues/138.
MFC after: 3 days
The fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable.
That value does not consider tcp options like timestamps,
while the initial window calculation is using
the correct dynamic tcp_maxseg() function.
Due to this interaction, the last, full size segment
is considered too short and not sent out immediately.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26478
Adjust ssthresh in after_idle to the maximum of
the prior ssthresh, or 3/4 of the prior cwnd. See
RFC2861 section 2 for an in depth explanation for
the rationale around this.
As newreno is the default "fall-through" reaction,
most tcp variants will benefit from this.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D22438
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
that rtentry can contain multiple nexthops.
Differential Revision: https://reviews.freebsd.org/D26497
Move the initialization of these variables to the beginning of their
respective functions.
On our end this creates a small amount of unneeded churn, as these
variables are properly initialized before their first use in all cases.
However, changing this benefits at least one downstream consumer
(NetApp) by allowing local and future modifications to these functions
to be made without worrying about where the initialization occurs.
Reviewed by: melifaro, rscheff
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26454
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.
A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.
On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.
An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.
On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.
Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.
Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.
Reviewed by: kib@
Relnotes: Yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25873
This will be used by some upcoming changes to if_vxlan(4). RFC 7348 (VXLAN)
says that the UDP checksum "SHOULD be transmitted as zero. When a packet is
received with a UDP checksum of zero, it MUST be accepted for decapsulation."
But the original IPv6 RFCs did not allow zero UDP checksum. RFC 6935 attempts
to resolve this.
Reviewed by: kib@
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25873
During the DCTCP improvements, use of CCF_DELACK was
removed. This change is just to rename the unused flag
bit to prevent use of it, without also re-implementing
the tcp_input and tcp_output interfaces.
No functional change.
Reviewed by: chengc_netapp.com, tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26181
stacks with a SENT_FIN outstanding. Both rack and bbr will only send a
FIN if all data is ack'd so this must be enforced. Also if the previous stack
sent the FIN we need to make sure in rack that when we manufacture the
"unknown" sends that we include the proper HAS_FIN bits.
Note for BBR we take a simpler approach and just refuse to switch.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D26269
bbr_log_type_hrdwtso() is a file local static unused function.
Remove it to avoid warnings on kernel compiles.
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D26331
No functional changes.
net/route/shared.h was created in the inital phases of nexthop conversion.
It was intended to serve the same purpose as route_var.h - share definitions
of functions and structures between the routing subsystem components. At
that time route_var.h was included by many files external to the routing
subsystem, which largerly defeats its purpose.
As currently this is not the case anymore and amount of route_var.h includes
is roughly the same as shared.h, retire the latter in favour of the former.
of acked bytes as described in Section 2.2 of that document.
This patch ensures that this limit is not also applied in congestion
avoidance. Applying this limit also in congestion avoidance can result in
using less bandwidth than allowed.
Reported by: l.tian.email@gmail.com
Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D26120
Remove most special treatment for ifnet TLS in the TCP stack, except
for code to avoid mixing handshakes and bulk data.
This code made heroic efforts to send down entire TLS records to
NICs. It was added to improve the PCIe bus efficiency of older TLS
offload NICs which did not keep state per-session, and so would need
to re-DMA the first part(s) of a TLS record if a TLS record was sent
in multiple TCP packets or TSOs. Newer TLS offload NICs do not need
this feature.
At Netflix, we've run extensive QoE tests which show that this feature
reduces client quality metrics, presumably because the effort to send
TLS records atomically causes the server to both wait too long to send
data (leading to buffers running dry), and to send too much data at
once (leading to packet loss).
Reviewed by: hselasky, jhb, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26103
Since cubic calculates cwnd based on absolute
time, retaining RFC3465 (ABC) once-per-window updates
can lead to dramatic changes of cwnd in the convex
region. Updating cwnd for each incoming ack minimizes
this delta, preventing unintentional line-rate bursts.
Reviewed by: chengc_netapp.com, tuexen (mentor)
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26060
Remove unused arguments from dom_rtattach/dom_rtdetach functions and make
them return/accept 'struct rib_head' instead of 'void **'.
Declare inet/inet6 implementations in the relevant _var.h headers similar
to domifattach / domifdetach.
Add rib_subscribe_internal() function to accept subscriptions to the rnh
directly.
Differential Revision: https://reviews.freebsd.org/D26053
The Cubic concave region was not aligned nicely for the very first exit from
slow start, where a 50% cwnd reduction is done instead of the normal 30%.
This addresses an issue, where a short line-rate burst could result from that
sudden jump of cwnd.
In addition, the Fast Convergence Heuristic has been expanded to work also
with ECN induced congestion response.
Submitted by: chengc_netapp.com
Reported by: chengc_netapp.com
Reviewed by: tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D25976
Initializing K to zero in D23655 introduced a miscalculation,
where cwnd would suddenly jump to cwnd_max instead of gradually
increasing, after leaving slow-start.
Properly calculating K instead of resetting it to zero resolves
this issue. Also making sure, that cwnd is recalculated at the
earliest opportunity once slow-start is over.
Reported by: chengc_netapp.com
Reviewed by: chengc_netapp.com, tuexen (mentor), rgrimes (mentor)
Approved by: tuexen (mentor), rgrimes (mentor)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D25746
Adding proper accounting of sacked_bytes and (per-ACK)
delivered data to the SACK scoreboard. This will
allow more aspects of RFC6675 to be implemented as well
as Proportional Rate Reduction (RFC6937).
Prior to this change, the pipe calculation controlled with
net.inet.tcp.rfc6675_pipe was also susceptible to incorrect
results when more than 3 (or 4) holes in the sequence space
were present, which can no longer all fit into a single
ACK's SACK option.
Reviewed by: kbowling, rgrimes (mentor)
Approved by: rgrimes (mentor, blanket)
MFC after: 3 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D18624
* Let the accepted TCP/IPv4 socket inherit the configured TTL and
TOS value.
* Let the accepted TCP/IPv6 socket inherit the configured Hop Limit.
* Use the configured Hop Limit and Traffic Class when sending
IPv6 packets.
Reviewed by: rrs, lutz_donnerhacke.de
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D25909