The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
returned in case of successful operation result. There are two problems with
this appoach. First is that it doesn't provide enough information for the
upcoming multipath changes, where rtentry refers to a new nexthop group,
and there is no way of guessing which paths were added during the change.
Second is that some rtentry fields can change during notification and
protecting from it by requiring customers to unlock rtentry is not desired.
Additionally, as the consumers such as rtsock do know which operation they
request in advance, making explicit add/change/del versions of the functions
makes sense, especially given the functions don't share a lot of code.
With that in mind, introduce rib_cmd_info notification structure and
rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
It will be used in upcoming generalized notifications.
* Move definitions of the new functions and some other functions/structures
used for the routing table manipulation to a separate header file,
net/route/route_ctl.h. net/route.h is a frequently used file included in
~140 places in kernel, and 90% of the users don't need these definitions.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25067
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
returned in case of successful operation result. There are two problems with
this appoach. First is that it doesn't provide enough information for the
upcoming multipath changes, where rtentry refers to a new nexthop group,
and there is no way of guessing which paths were added during the change.
Second is that some rtentry fields can change during notification and
protecting from it by requiring customers to unlock rtentry is not desired.
Additionally, as the consumers such as rtsock do know which operation they
request in advance, making explicit add/change/del versions of the functions
makes sense, especially given the functions don't share a lot of code.
With that in mind, introduce rib_cmd_info notification structure and
rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
It will be used in upcoming generalized notifications.
* Move definitions of the new functions and some other functions/structures
used for the routing table manipulation to a separate header file,
net/route/route_ctl.h. net/route.h is a frequently used file included in
~140 places in kernel, and 90% of the users don't need these definitions.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25067
multipath control plane changed described in D24141.
Currently route.c contains core routing init/teardown functions, route table
manipulation functions and various helper functions, resulting in >2KLOC
file in total. This change moves most of the route table manipulation parts
to a dedicated file, simplifying planned multipath changes and making
route.c more manageable.
Differential Revision: https://reviews.freebsd.org/D24870
After making rtentry reclamation backed by epoch(9) in r361409, there is
no reason in keeping reference counting code.
Differential Revision: https://reviews.freebsd.org/D24867
Currently the only reason of refcounting rtentries is the need to report
the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
is used by some of the customers to get the matching prefix along
with nexthops, in more efficient way.
The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .
Differential Revision: https://reviews.freebsd.org/D24866
rnh_close callbackes was used by the in[6]_clsroute() handlers,
doing cleanup in the route cloning code. Route cloning was eliminated
somewhere around r186119. Last callback user was eliminated in r186215,
11 years ago.
Differential Revision: https://reviews.freebsd.org/D24793
Last user of rtalloc1() KPI has been eliminated in rS360631.
As kernel is now fully switched to use new routing KPI defined in
rS359823, remove old lookup functions.
Differential Revision: https://reviews.freebsd.org/D24776
Currently each rtentry has dst&gateway allocated separately from another zone,
bloating cache accesses.
Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning,
leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64).
Fields needed for the datapath are destination sockaddr and rt_nhop.
So far it doesn't look like there is other routable addressing protocol other
than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes.
With that in mind, embed dst into struct rtentry, making the first 24 bytes
of rtentry within 128 bytes. That is enough to make IPv6 address within first
128 bytes.
It is still pretty easy to add code for supporting separately-allocated dst,
however it doesn't make a lot of sense in having such code without a use case.
As rS359823 moved the gateway to the nexthop structure, the dst embedding change
removes the need for any additional allocations done by rt_setgate().
Lastly, as a part of cleanup, remove counter(9) allocation code, as this field
is not used in packet processing anymore.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24669
Create rib_lookup() wrapper around per-af dataplane lookup functions.
This will help in the cases of having control plane af-agnostic code.
Switch ifa_ifwithroute() to use this function instead of rtalloc1().
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24731
After converting routing subsystem customers to use nexthop objects
defined in r359823, some fields in struct rtentry became unused.
This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
along with the code initializing and updating these fields.
Cleanup of the remaining fields will be addressed by D24669.
This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
nexthop and tries to update rte nhop pointer. Only last part is done under
WLOCK.
In the hypothetical scenarious where multiple rtsock clients
repeatedly issue RTM_CHANGE requests for the same route, route may get
updated between read and update operation. This is addressed by retrying
the operation multiple (3) times before returning failure back to the
caller.
Differential Revision: https://reviews.freebsd.org/D24666
Continue routing subsystem conversion to nhop objects defined in r359823.
Use fields from nhop structure instead of "struct rtentry" fields.
This is one of the last changes prior to removing rt_ifp, rt_ifa,
rt_gateway and rt_mtu from struct rtentry.
Differential Revision: https://reviews.freebsd.org/D24609
With the upcoming multipath changes described in D24141,
rt->rt_nhop can potentially point to a nexthop group instead of
an individual nhop.
To simplify caller handling of such cases, change ifa_rtrequest() callback
to pass changed nhop directly.
Differential Revision: https://reviews.freebsd.org/D24604
Nexthop objects implementation, defined in r359823,
introduced sys/net/route directory intended to hold all
routing-related code. Move recently-introduced route_temporal.c and
private route_var.h header there.
Differential Revision: https://reviews.freebsd.org/D24597
r360292 switched most of the remaining routing customers to a new KPI,
leaving a bunch of wrappers for old routing lookup functions unused.
Remove them from the tree as a part of routing cleanup.
Differential Revision: https://reviews.freebsd.org/D24569
This change is build on top of nexthop objects introduced in r359823.
Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.
Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.
Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.
Differential Revision: https://reviews.freebsd.org/D24340
One of the goals of the new routing KPI defined in r359823 is to entirely
hide`struct rtentry` from the consumers. It will allow to improve routing
subsystem internals and deliver more features much faster.
This commit is mostly mechanical change to eliminate direct struct rtentry
field accesses.
The only notable difference is AF_LINK gateway encoding.
AF_LINK gw is used in routing stack for operations with interface routes
and host loopback routes.
In the former case it indicates _some_ non-NULL gateway, as the interface
is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting.
In the latter case the interface index inside gateway was used by the IPv6
datapath to verify address scope for link-local interfaces.
Kernel uses struct sockaddr_dl for this type of gateway. This structure
allows for specifying rich interface data, such as mac address and interface
name. However, this results in relatively large structure size - 52 bytes.
Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside
in the first 8 bytes of the structure.
In the new KPI, struct nhop_object tries to be cache-efficient, hence
embodies gateway address inside the structure. In the AF_LINK case it
stores stortened version of the structure - struct sockaddr_dl_short,
which occupies 16 bytes. After D24340 changes, the data inside AF_LINK
gateway will not be used in the kernel at all, leaving rtsock as the only
potential concern.
The difference in rtsock reporting:
(old)
got message of size 240 on Thu Apr 16 03:12:13 2020
RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0
(new)
got message of size 200 on Sun Apr 19 09:46:32 2020
RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0
Note 40 bytes different (52-16 + alignment).
However, gateway is still a valid AF_LINK gateway with proper data filled in.
It is worth noting that these particular messages (interface routes) are mostly
ignored by routing daemons:
* bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages.
* quagga/frr ignores routes without gateway
More detailed overview on how rtsock messages are used by the
routing daemons to reconstruct the kernel view, can be found in D22974.
Differential Revision: https://reviews.freebsd.org/D24519
One of the goals of the new routing KPI defined in r359823 is to
entirely hide`struct rtentry` from the consumers. It will allow to
improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
struct rtentry field accesses.
Additionally, with the followup multipath changes, single rtentry can point
to multiple nexthops.
With that in mind, convert rti_filter callback used when traversing the
routing table to accept pair (rt, nhop) instead of nexthop.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24440
One of the goals of the new routing KPI defined in r359823 is to entirely hide
`struct rtentry` from the consumers. Doing so will allow to improve routing
subsystem internals and deliver features more easily. This change is one of
the ongoing changes to eliminate direct struct rtentry field accesses.
It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification
code to avoid accessing most of the rtentry fields.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24404
This is the foundational change for the routing subsytem rearchitecture.
More details and goals are available in https://reviews.freebsd.org/D24141 .
This patch introduces concept of nexthop objects and new nexthop-based
routing KPI.
Nexthops are objects, containing all necessary information for performing
the packet output decision. Output interface, mtu, flags, gw address goes
there. For most of the cases, these objects will serve the same role as
the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
multiple BGP full-views, as these objects will be shared between routing
entries. This allows to store more information in the nexthop.
New KPI:
struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
uint32_t scopeid, uint32_t flags, uint32_t flowid);
These 2 function are intended to replace all all flavours of
<in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous
fib[46]-generation functions.
Upon successful lookup, they return nexthop object which is guaranteed to
exist within current NET_EPOCH. If longer lifetime is desired, one can
specify NHR_REF as a flag and get a referenced version of the nexthop.
Reference semantic closely resembles rtentry one, allowing sed-style conversion.
Additionally, another 2 functions are introduced to support uRPF functionality
inside variety of our firewalls. Their primary goal is to hide the multipath
implementation details inside the routing subsystem, greatly simplifying
firewalls implementation:
int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
uint32_t flags, const struct ifnet *src_if);
All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
embedding and allowing to support IPv4 link-locals in the future.
Structure changes:
* rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
* rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.
Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.
More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232
Reviewed by: ae,glebius(initial version)
Differential Revision: https://reviews.freebsd.org/D24232
No functional changes.
* Move route addition / route deletion code from rtrequest1_fib()
to add_route() and del_route() respectively.
* Rename rtrequest1_fib_change() to change_route() for consistency.
* Shrink the scope of ugly info #defines.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24349
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
Redirect (and temporal) route expiration was broken a while ago.
This change brings route expiration back, with unified IPv4/IPv6 handling code.
It introduces net.inet.icmp.redirtimeout sysctl, allowing to set
an expiration time for redirected routes. It defaults to 10 minutes,
analogues with net.inet6.icmp6.redirtimeout.
Implementation uses separate file, route_temporal.c, as route.c is already
bloated with tons of different functions.
Internally, expiration is implemented as an per-rnh callout scheduled when
route with non-zero rt_expire time is added or rt_expire is changed.
It does not add any overhead when no temporal routes are present.
Callout traverses entire routing tree under wlock, scheduling expired routes
for deletion and calculating the next time it needs to be run. The rationale
for such implemention is the following: typically workloads requiring large
amount of routes have redirects turned off already, while the systems with
small amount of routes will not inhibit large overhead during tree traversal.
This changes also fixes netstat -rn display of route expiration time, which
has been broken since the conversion from kread() to sysctl.
Reviewed by: bz
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D23075
Having metadata such as fibnum or vnet in the struct rib_head
is handy as it eases building functionality in the routing space.
This change is required to properly bring back route redirect support.
Reviewed by: bz
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D23047
- Only take an ifaddr ref in in rt_exportinfo() if the caller explicitly
requests it. Take care to release it in this case.
- Don't unconditionally take a ref in rtrequest1_fib(). rt_getifa_fib()
will acquire a reference, in which case we would previously acquire
two references.
- Stop taking a reference in rtinit1() before calling rtrequest1_fib().
rtrequest1_fib() will acquire a reference for the RTM_ADD case.
PR: 242746
Reviewed by: melifaro (previous version)
Tested by: ghuckriede@blackberry.com
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22912
same after the network stack was epochified. Merge the two into one function
and cleanup all uses of ifnet_byindex_locked().
While at it:
- Add branch prediction macros.
- Make sure the ifnet pointer is only deferred once,
also when code optimisation is disabled.
Sponsored by: Mellanox Technologies
instead of calling an SCTP specific function from the IP code.
This is a requirement of supporting SCTP as a kernel loadable module.
This patch was developed by markj@, I tweaked a bit the SCTP related
code.
Submitted by: markj@
MFC after: 3 days
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
Currently rinit1() and its IPv6 counterpart
nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address
during route insertion and change it afterwards. This behaviour
brings complications to the routing stack and the users of its
upcoming notification system.
This change fixes both rinit1() and nd6_prefix_onlink_rtrequest()
by filling in proper gateway in the beginning. It does not change any
of the userland notifications as in both cases, they happen after
the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()).
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20328
Currently such routes are added with a link-level IFA, which is
plain wrong. Only after the insertion they get fixed by the special
link_rtrequest() ifa handler. This behaviour complicates routing code
and makes ifa selection more complex.
Streamline this process by explicitly moving link_rtrequest() logic
to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all
this logic in the loopback route case by explicitly specifying
proper rt_ifa inside the ifa_maintain_loopback_route().§
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20076
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.
Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.
Reviewed by: melifaro, olivier
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15789
Turns out there is code which ends up passing M_ZERO to counters.
Since counters zero unconditionally on their own, just ignore drop the
flag in that place.
an rtentry. r334118 introduced a case when this was not done.
While we're here make the intent more obvious by moving the refcount
bump down to when we know we'll actually need it.
Reported by: markj
Increment the route table generation count after modifying a
route. This signals back to TCP connections that they need to
update their L2 caches as the gateway for their route may have
changed. This is a heavier hammer than is needed, strictly
speaking, but route changes will be unlikely enough that the
performance effects of invalidating all connection route caches
should be negligible.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13990
Reviewed by: karels