The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
returned in case of successful operation result. There are two problems with
this appoach. First is that it doesn't provide enough information for the
upcoming multipath changes, where rtentry refers to a new nexthop group,
and there is no way of guessing which paths were added during the change.
Second is that some rtentry fields can change during notification and
protecting from it by requiring customers to unlock rtentry is not desired.
Additionally, as the consumers such as rtsock do know which operation they
request in advance, making explicit add/change/del versions of the functions
makes sense, especially given the functions don't share a lot of code.
With that in mind, introduce rib_cmd_info notification structure and
rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
It will be used in upcoming generalized notifications.
* Move definitions of the new functions and some other functions/structures
used for the routing table manipulation to a separate header file,
net/route/route_ctl.h. net/route.h is a frequently used file included in
~140 places in kernel, and 90% of the users don't need these definitions.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25067
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
returned in case of successful operation result. There are two problems with
this appoach. First is that it doesn't provide enough information for the
upcoming multipath changes, where rtentry refers to a new nexthop group,
and there is no way of guessing which paths were added during the change.
Second is that some rtentry fields can change during notification and
protecting from it by requiring customers to unlock rtentry is not desired.
Additionally, as the consumers such as rtsock do know which operation they
request in advance, making explicit add/change/del versions of the functions
makes sense, especially given the functions don't share a lot of code.
With that in mind, introduce rib_cmd_info notification structure and
rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
It will be used in upcoming generalized notifications.
* Move definitions of the new functions and some other functions/structures
used for the routing table manipulation to a separate header file,
net/route/route_ctl.h. net/route.h is a frequently used file included in
~140 places in kernel, and 90% of the users don't need these definitions.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25067
fib[46]_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.
Differential Revision: https://reviews.freebsd.org/D24974
fib6_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24973
fibX_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Use specialized fib[46]_check_urpf() from newer KPI instead,
to allow removal of older KPI.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24978
multipath control plane changed described in D24141.
Currently route.c contains core routing init/teardown functions, route table
manipulation functions and various helper functions, resulting in >2KLOC
file in total. This change moves most of the route table manipulation parts
to a dedicated file, simplifying planned multipath changes and making
route.c more manageable.
Differential Revision: https://reviews.freebsd.org/D24870
Currently the only reason of refcounting rtentries is the need to report
the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
is used by some of the customers to get the matching prefix along
with nexthops, in more efficient way.
The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .
Differential Revision: https://reviews.freebsd.org/D24866
Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.
Reviewed by: tuexen (rscheff, rrs previous version)
MFC after: 1 month
Sponsored by: Forcepoint LLC
Differential Revision: https://reviews.freebsd.org/D24781
If the neighbor entry for an IPv6 TCP session using unmapped
mbufs times out, IPv6 will send an icmp6 dest. unreachable
message. In doing this, it will try to do a software checksum
on the reflected packet. If this is a TCP session using unmapped
mbufs, then there will be a kernel panic.
To fix this, just free packets with unmapped mbufs, rather
than sending the icmp.
Reviewed by: np, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24821
The IP_NO_SND_TAG_RL flag to ip{,6}_output() means that the packets
being sent should bypass hardware rate limiting. This is typically used
by modern TCP stacks for rexmits.
This support was added to IPv4 in r352657, but never added to IPv6, even
though rack and bbr call ip6_output() with this flag.
Reviewed by: rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24822
Back out the IPv6 portion of r360903, as the stamp_tag param
is apparently not supported in upstream FreeBSD.
Sponsored by: Netflix
Pointy hat to: gallatin
The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.
When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped. Failure
to stamp a NIC TLS packet will result in crypto
issues.
Reviewed by: hselasky, rrs
Sponsored by: Netflix, Mellanox
After converting routing subsystem customers to use nexthop objects
defined in r359823, some fields in struct rtentry became unused.
This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
along with the code initializing and updating these fields.
Cleanup of the remaining fields will be addressed by D24669.
This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
nexthop and tries to update rte nhop pointer. Only last part is done under
WLOCK.
In the hypothetical scenarious where multiple rtsock clients
repeatedly issue RTM_CHANGE requests for the same route, route may get
updated between read and update operation. This is addressed by retrying
the operation multiple (3) times before returning failure back to the
caller.
Differential Revision: https://reviews.freebsd.org/D24666
With the upcoming multipath changes described in D24141,
rt->rt_nhop can potentially point to a nexthop group instead of
an individual nhop.
To simplify caller handling of such cases, change ifa_rtrequest() callback
to pass changed nhop directly.
Differential Revision: https://reviews.freebsd.org/D24604
Nexthop objects implementation, defined in r359823,
introduced sys/net/route directory intended to hold all
routing-related code. Move recently-introduced route_temporal.c and
private route_var.h header there.
Differential Revision: https://reviews.freebsd.org/D24597
One of the goals of the new routing KPI defined in r359823
is to entirely hide`struct rtentry` from the consumers.
It will allow to improve routing subsystem internals and deliver
features much faster.
This is one of the last changes, effectively moving struct rtentry
definition to a net/route_var.h header, internal to the routing subsystem.
Differential Revision: https://reviews.freebsd.org/D24580
r360292 switched most of the remaining routing customers to a new KPI,
leaving a bunch of wrappers for old routing lookup functions unused.
Remove them from the tree as a part of routing cleanup.
Differential Revision: https://reviews.freebsd.org/D24569
Introduce new fib[46]_lookup_debugnet() functions serving as a
special interface for the crash-time operations. Underlying
implementation will try to return lookup result if
datastructures are not corrupted, avoding locking.
Convert debugnet to use fib4_lookup_debugnet() and switch it
to use nexthops instead of rtentries.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D24555
It was broken by r360292 as fib6_lookup() assumes de-embedded addresses
while rtalloc_mpath_fib() requires sockaddr with embedded ones.
New fib6_lookup() transparently supports multipath, hence
remove old RADIX_MPATH condition.
This change is build on top of nexthop objects introduced in r359823.
Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.
Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.
Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.
Differential Revision: https://reviews.freebsd.org/D24340
One of the goals of the new routing KPI defined in r359823 is to entirely
hide`struct rtentry` from the consumers. It will allow to improve routing
subsystem internals and deliver more features much faster.
This commit is mostly mechanical change to eliminate direct struct rtentry
field accesses.
The only notable difference is AF_LINK gateway encoding.
AF_LINK gw is used in routing stack for operations with interface routes
and host loopback routes.
In the former case it indicates _some_ non-NULL gateway, as the interface
is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting.
In the latter case the interface index inside gateway was used by the IPv6
datapath to verify address scope for link-local interfaces.
Kernel uses struct sockaddr_dl for this type of gateway. This structure
allows for specifying rich interface data, such as mac address and interface
name. However, this results in relatively large structure size - 52 bytes.
Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside
in the first 8 bytes of the structure.
In the new KPI, struct nhop_object tries to be cache-efficient, hence
embodies gateway address inside the structure. In the AF_LINK case it
stores stortened version of the structure - struct sockaddr_dl_short,
which occupies 16 bytes. After D24340 changes, the data inside AF_LINK
gateway will not be used in the kernel at all, leaving rtsock as the only
potential concern.
The difference in rtsock reporting:
(old)
got message of size 240 on Thu Apr 16 03:12:13 2020
RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0
(new)
got message of size 200 on Sun Apr 19 09:46:32 2020
RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0
Note 40 bytes different (52-16 + alignment).
However, gateway is still a valid AF_LINK gateway with proper data filled in.
It is worth noting that these particular messages (interface routes) are mostly
ignored by routing daemons:
* bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages.
* quagga/frr ignores routes without gateway
More detailed overview on how rtsock messages are used by the
routing daemons to reconstruct the kernel view, can be found in D22974.
Differential Revision: https://reviews.freebsd.org/D24519
mb_reclaim() calls the protocol drain routines for each protocol in each
domain. Some protocols exist in more than one domain and share drain
routines. In the case of SCTP, it also uses the same drain routine for
its SOCK_SEQPACKET and SOCK_STREAM entries in the same domain.
On systems with INET, INET6, and SCTP all defined, mb_reclaim() calls
sctp_drain() four times. On systems with INET and INET6 defined,
mb_reclaim() calls tcp_drain() twice. mb_reclaim() is the only in-tree
caller of the pr_drain protocol entry.
Eliminate this duplication by ensuring that each pr_drain routine is only
specified for one protocol entry in one domain.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24418
One of the goals of the new routing KPI defined in r359823 is to
entirely hide`struct rtentry` from the consumers. It will allow to
improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
struct rtentry field accesses.
Additionally, with the followup multipath changes, single rtentry can point
to multiple nexthops.
With that in mind, convert rti_filter callback used when traversing the
routing table to accept pair (rt, nhop) instead of nexthop.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24440
Update ip6_forward() internals to use deembedded IPv6 addresses
to simplify calls to the new KPI and prepare for the future
scope-embedding cleanup.
Add in6_get_unicast_scopeid() and in6_set_unicast_scopeid() scopeid
operation functions tailored for unicast processing.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24334
One of the goals of the new routing KPI defined in r359823 is to entirely hide
`struct rtentry` from the consumers. Doing so will allow to improve routing
subsystem internals and deliver features more easily. This change is one of
the ongoing changes to eliminate direct struct rtentry field accesses.
It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification
code to avoid accessing most of the rtentry fields.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24404
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.
This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.
Many thanks to glebius for helping to make this better in
the Netflix tree.
Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213
This is the foundational change for the routing subsytem rearchitecture.
More details and goals are available in https://reviews.freebsd.org/D24141 .
This patch introduces concept of nexthop objects and new nexthop-based
routing KPI.
Nexthops are objects, containing all necessary information for performing
the packet output decision. Output interface, mtu, flags, gw address goes
there. For most of the cases, these objects will serve the same role as
the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
multiple BGP full-views, as these objects will be shared between routing
entries. This allows to store more information in the nexthop.
New KPI:
struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
uint32_t scopeid, uint32_t flags, uint32_t flowid);
These 2 function are intended to replace all all flavours of
<in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous
fib[46]-generation functions.
Upon successful lookup, they return nexthop object which is guaranteed to
exist within current NET_EPOCH. If longer lifetime is desired, one can
specify NHR_REF as a flag and get a referenced version of the nexthop.
Reference semantic closely resembles rtentry one, allowing sed-style conversion.
Additionally, another 2 functions are introduced to support uRPF functionality
inside variety of our firewalls. Their primary goal is to hide the multipath
implementation details inside the routing subsystem, greatly simplifying
firewalls implementation:
int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);
All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
embedding and allowing to support IPv4 link-locals in the future.
Structure changes:
* rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
* rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.
Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.
More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232
Reviewed by: ae,glebius(initial version)
Differential Revision: https://reviews.freebsd.org/D24232
Split their functionality by moving random seed allocation
to SYSINIT and calling (new) generic multipath function from
standard IPv4/IPv5 RIB init handlers.
Differential Revision: https://reviews.freebsd.org/D24356
Interfaces may be detached from a taskqueue_thread task, for example by
prison_complete(), so after r359438, when draining the queue we may end
up deadlocking.
Reported by: Jenkins via lwhsu
MFC with: r359438
The inpcb needs to be locked when we update output packet options.
Otherwise it is possible for the IPV6_2292PKTOPTIONS handler to free
packet option structures while another thread is reading or updating
them.
Note that the option handler is still kind of broken. For instance it
frees all options before performing privilege checks for individual
options. However, this can be fixed separately.
Reported by: syzbot+52eb0fd4ddc119787f9d@syzkaller.appspotmail.com
Reviewed by: bz, tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24125
In r231852 I added in6_selectroute_fib() as a compat function with the
fibnum as an extra argument compared to in6_selectroute() to keep the
KPI stable.
Way too late retire this function again and add the fib to in6_selectroute()
which also only has a single consumer now and was an orphan function before.
Implement the equivalent of r347375 (IPv4) for the IPv6 output path.
In IPv6 we get passed a cached route (and inp) by udp6_output()
depending on whether we acquired a write lock on the INP.
In case we neither bind nor connect a first UDP packet would come in
with a cached route (wlocked) and all further packets would not.
In case we bind and do not connect we never write-lock the inp.
When we do not pass in a cached route, rather than providing the
storage for a route locally and pass it over the old lookup code
and down the stack, use the new route lookup KPI and acquire all
details we need to send the packet.
Compared to the IPv4 code the IPv6 code has a couple of possible
complications: given an option with a routing hdr/caching route there,
and path mtu (ro_pmtu) case which now equally has to deal with the
possibility of having a route which is NULL passed in, and the
fwd_tag in case a firewall changes the next hop (something to
factor out in the future).
Sponsored by: Netflix
Reviewed by: glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D23886
Like for IPv4 add nh_ia to the ext interface and return rt_ifa
in order to be used for, e.g., packet/octets accounting purposes.
Reviewed by: melifaro
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23873
In fib6_rte_to_nh_* when returning a link-local gateway address
currently we do clear the scope. That could be recovered using
the ifp returned as well, but the code in general seems to
expect a link-local address with scope embeedded as otherwise
the "dst" (gw) passed to the output routines will not include
scope and not send the packet out (the right interface).
Do not clear the scope when returning a link-local address and
allow packets to go out (the right interface).
Remove the (now) extra scope recovery in the IPv6 fast-fwd code.
Sponsored by: Netflix
Reviewed by: melifaro, ae
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23872
In certain cases (probably not during normal operation but observed in
the lab during development) ip6_ouput() could return without error
and ifpp (&oifp) not updated.
Given oifp was never initialized we would take the later branch
as oifp was not NULL, and when calling icmp6_ifstat_inc() we would
panic dereferencing a garbage pointer.
For code stability initialize oifp to NULL before first use to always
have a deterministic value and not rely on a called function to behave
and always and for ever do the work for us as we hope for.
MFC after: 3 days
Sponsored by: Netflix
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
When moving the calculations for the optlen into the if (opt) block
which deals with possible extension headers I failed to initialise
unfragpartlen to the ipv6 header length if there were no extension
headers present. Correct that mistake to make IPv6 fragment length
calculcations work again.
Reported by: hselasky, kp
OKed by: hselasky, kp
MFC after: 3 days
X-MFC with: r358167
PR: 244393
In two places in ip6_output we are doing (delayed) checksum calculations.
The initial logic came from SCTP in r205075,205104 and later I copied
and adjusted it for the TCP|UDP case in r235958.
The problem was that the original SCTP offsets were already wrong for any
case with extension headers present given IPv6 extension headers are not
part of the pseudo checksum calculations.
The later changes do not help in case there is checksum offloading as for
extension headers (incl. fragments) we do currrently never offload as we
have no infrastructure to know whether the NIC can handle these cases.
Correct the offsets for delayed checksum calculations and properly handle
mbuf flags. In addition harmonize the almost identical duplicate code.
While here eliminate the now unneeded variable hlen and add an always
missing mtod() call in the 1-b and 3 cases after the introduction of
the mb_unmapped_to_ext() calls.
Reported by: Francis Dupont (fdupont isc.org)
PR: 243675
MFC after: 6 days
Reviewed by: markj (earlier version), gallatin
Differential Revision: https://reviews.freebsd.org/D23760
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Approved by: kib (mentor, blanket)
Differential Revision: https://reviews.freebsd.org/D23639