There have been many changes to rack over the last couple of years, including:
a) Ability when switching stacks to have one stack query another.
b) Internal use of micro-second timers instead of ticks.
c) Many changes to pacing in forms of
1) Improvements to Dynamic Goodput Pacing (DGP)
2) Improvements to fixed rate paciing
3) A new feature called hybrid pacing where the requestor can
get a combination of DGP and fixed rate pacing with deadlines
for delivery that can dynamically speed things up.
d) All kinds of bugs found during extensive testing and use of the
rack stack for streaming video and in fact all data transferred
by NF
Reviewed by: glebius, gallatin, tuexen
Sponsored By: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D39402
So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).
Once all this lands I will then update rack to begin using all these new features.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210
Get/set commands can now choose to provide the interface name rather
than the interface index. This allows userspace to avoid a call to
if_nametoindex().
Suggested by: melifaro
Reviewed by: melifaro
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D39359
Previously, SACK rescue retransmissions would only happen
on a loss recovery at the tail end of the send buffer.
This extends the mechanism such that partial ACKs without SACK
mid-stream also trigger a rescue retransmission to try avoid
an otherwise unavoidable retransmission timeout.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D39274
Use the same counter that ip_input()/ip6_input() use for bad destination
address. For IPv6 this is already heavily abused ip6s_badscope, which
needs to be split into several separate error counters.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D39234
When we're not in unicast mode we need to change the source MAC address.
The check for this was wrong, because IN_MULTICAST() assumes host
endianness and the address in sc_carpaddr is in network endianness.
Sponsored by: Rubicon Communications, LLC ("Netgate")
This change does the following:
Base Netlink KPIs (ability to register the family, parse and/or
write a Netlink message) are always present in the kernel. Specifically,
* Implementation of genetlink family/group registration/removal,
some base accessors (netlink_generic_kpi.c, 260 LoC) are compiled in
unconditionally.
* Basic TLV parser functions (netlink_message_parser.c, 507 LoC) are
compiled in unconditionally.
* Glue functions (netlink<>rtsock), malloc/core sysctl definitions
(netlink_glue.c, 259 LoC) are compiled in unconditionally.
* The rest of the KPI _functions_ are defined in the netlink_glue.c,
but their implementation calls a pointer to either the stub function
or the actual function, depending on whether the module is loaded or not.
This approach allows to have only 1k LoC out of ~3.7k LoC (current
sys/netlink implementation) in the kernel, which will not grow further.
It also allows for the generic netlink kernel customers to load
successfully without requiring Netlink module and operate correctly
once Netlink module is loaded.
Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D39269
LRO bypasses normal ip_input()/tcp_input() and lacks several checks
that are present in the normal path. Without these checks, it
is possible to trigger assertions added in b0ccf53f24
Reviewed by: glebius, rrs
Sponsored by: Netflix
Allow users to configure the address to send carp messages to. This
allows carp to be used in unicast mode, which is useful in certain
virtual configurations (e.g. AWS, VMWare ESXi, ...)
Reviewed by: melifaro
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D38940
Allow carp configuration information to be supplied and retrieved via
netlink.
Reviewed by: melifaro
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D39048
The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.
Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831
When receiving a cookie, the receiver does not know whether the
peer retransmitted the COOKIE-ECHO chunk or not. Therefore, don't
do an RTT measurement. It might be much too long.
To overcome this limitation, one could do at least two things:
1. Bundle the INIT-ACK chunk with a HEARTBEAT chunk for doing the
RTT measurement. But this is not allowed.
2. Add a flag to the COOKIE-ECHO chunk, which indicates that it
is the initial transmission, and not a retransmission. But
this requires an RFC.
MFC after: 1 week
Enforce consistency between announcing 0-cksum support and actually
using it in the association. The value from the inp when the
INIT ACK is sent must be used, not the one from the inp when the
cookie is received.
these functions exclusively return (0) and (1), so convert them to bool
We also convert some networking related jail functions from int to bool
some of which were returning an error that was never used.
Differential Revision: https://reviews.freebsd.org/D29659
Reviewed by: imp, jamie (earlier version)
Pull Request: https://github.com/freebsd/freebsd-src/pull/663
The assertions added in commit b0ccf53f24 ("inpcb: Assert against
wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol
layers may pass the unspecified address to in_pcblookup().
Add some checks to filter out such packets before we attempt an inpcb
lookup:
- Disallow the use of an unspecified source address in in_pcbladdr() and
in6_pcbladdr().
- Disallow IP packets with an unspecified destination address.
- Disallow TCP packets with an unspecified source address, and add an
assertion to verify the comment claiming that the case of an
unspecified destination address is handled by the IP layer.
Reported by: syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com
Reported by: syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com
Reported by: syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38570
If there is no source filter entry => block if that's SSM ("exclude"
mode per RFC 3678 clause 3). If there is an entry => check its action &
block if the action is "exclude".
It would be nice if the test case in this PR were converted into an ATF
test case, but not blocking on that.
Reviewed by: imp, melifaro
Pull Request: https://github.com/freebsd/freebsd-src/pull/601
It has no effect, and an exp-run revealed that it is not in use.
PR: 261398 (exp-run)
Reviewed by: mjg, glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38822
This option was added in commit 0a100a6f1e but was never completed.
In particular, there is no logic to map flowids to different listening
sockets, so it accomplishes basically the same thing as SO_REUSEPORT.
Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to
balance among listening sockets using a hash of the 4-tuple and some
optional NUMA policy.
The option was never documented or completed, and an exp-run revealed
nothing using it in the ports tree. Moreover, it complicates the
already very complicated in_pcbbind_setup(), and the checking in
in_pcbbind_check_bindmulti() is insufficient. So, let's remove it.
PR: 261398 (exp-run)
Reviewed by: glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38574
There is rampant inconsistent formatting all around, make it mostly
style(9)-conformant.
While here:
- drop malloc casts
- rename a rw lock from mroute_mtx to mroute_lock
- replace NOTREACHED comment with __assert_unreachable
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D38652
tcp_trace was implemented in tcp_debug.c, which was removed recently.
Reviewed by: rscheff@, zlei@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38712
Extend the BBLog RTO event to deal with all timers of the base
stack. Also provide information about starting, stopping, and
running off. The expiration of the retransmission timer is
reported as it was done before.
Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38710
Rearrange the enum tt_which such that TT_REXMIT is 0. This allows
an extension of the BBLog event RTO in a backwards compatible way.
Remove tcptimers, which was only used in trpt, a utility removed
from the source tree recently.
Reviewed by: glebius@, guest-ccui@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38547
This information was available in trpt and is useful. So provide
a way to get this information via TCP BBLog.
Reviewed by: rscheff@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38701
This allows the addition of entries to tcp_log_events without
causing conflicts in the Netflix tree.
rrs@ will upstream the related functional changes eventually.
Reviewed by: guest-ccui@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38646
When the checks for INP_TIMEWAIT were removed, tcp_usr_close() and
tcp_usr_disconnect() were no longer prevented from calling
tcp_disconnect() on a socket that was already disconnected. This
triggered a panic in cxgbe(4) for TOE where the tcp_disconnect() on an
already-disconnected socket invoked tcp_output() on a socket that was
already in time-wait.
Reviewed by: rrs, np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37112
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_bind method and from there on go down the call
stack with family specific argument.
Reviewed by: zlei, melifaro, markj
Differential Revision: https://reviews.freebsd.org/D38601
The 0b70e3e78b changed the original design of a single entry point
into pfil(9) chains providing separate functions for the filtering
points that always provide mbufs and know the direction of a flow.
The motivation was to reduce branching. The logical continuation
would be to do the same for the filtering points that always provide
a memory pointer and retire the single entry point.
o Hooks now provide two functions: one for mbufs and optional for
memory pointers.
o pfil_hook_args() has a new member and pfil_add_hook() has a
requirement to zero out uninitialized data. Bump PFIL_VERSION.
o As it was before, a hook function for a memory pointer may realloc
into an mbuf. Such mbuf would be returned via a pointer that must
be provided in argument.
o The only hook that supports memory pointers is ipfw:default-link.
It is rewritten to provide two functions.
o All remaining uses of pfil_run_hooks() are converted to
pfil_mem_in().
o Transparent union of pfil_packet_t and tricks to fix pointer
alignment are retired. Internal pfil_realloc() reduces down to
m_devget() and thus is retired, too.
Reviewed by: mjg, ocochard
Differential revision: https://reviews.freebsd.org/D37977
soconnectat() tries to ensure that one cannot connect a connected
socket. However, the check is racy and does not really prevent two
threads from attempting to connect the same TCP socket.
Modify tcp_connect() and tcp6_connect() to perform the check again, this
time synchronized by the inpcb lock, under which we call
soisconnecting().
Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38507
tcp6_connect() is always called in a net_epoch read section.
Fixes: 3d76be28ec ("netinet6: require network epoch for in6_pcbconnect()")
Reviewed by: tuexen, glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38506
Split the in_pcblookup_hash_locked() function into several independent
subroutine calls, each of which does some kind of hash table lookup.
This refactoring makes it easier to introduce variants of the lookup
algorithm that behave differently depending on whether they are
synchronized by SMR or the PCB database hash lock.
While here, do some related cleanup:
- Remove an unused ifnet parameter from internal functions. Keep it in
external functions so that it can be used in the future to derive a v6
scopeid.
- Reorder the parameters to in_pcblookup_lbgroup() to be consistent with
the other lookup functions.
- Remove an always-true check from in_pcblookup_lbgroup(): we can assume
that we're performing a wildcard match.
No functional change intended.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D38364
This allows us to avoid spurious calls to ktls_disable_ifnet()
When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.
This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.
While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.
This change reduces spurious calls to ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.
Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380