The pidx argument of isc_rxd_flush() indicates which is the last valid
receive descriptor to be used by the NIC. However, current code has
multiple issues:
- Intel drivers write pidx to their RDT register, which means that
NICs will only use the descriptors up to pidx-1 (modulo ring size N),
and won't actually use the one pointed by pidx. This does not break
reception, but it is anyway confusing and suboptimal (the NIC will
actually see only N-2 descriptors as available, rather than N-1).
Other drivers (if_vmx, if_bnxt, if_mgb) adhere to this semantic).
- The semantic used by Intel (RDT is one descriptor past the last
valid one) is used by most (if not all) NICs, and it is also used
on the TX side (also in iflib). Since iflib is not currently
using this semantic for RX, it must decrement fl->ifl_pidx
(modulo N) before calling isc_rxd_flush(), and then the
per-driver callback implementation must increment the index
again (to match the real semantic). This is confusing and suboptimal.
- The iflib refill function is also called at initialization.
However, in case the ring size is smaller than 128 (e.g. if_mgb),
the refill function will actually prepare all the receive
descriptors (N), without leaving one unused, as most of NICs assume
(e.g. to avoid RDT to overrun RDH). I can speculate that the code
looks like this right now because this issue showed up during
testing (e.g. with if_mgb), and it was easy to workaround by
decrementing pidx before isc_rxd_flush().
The goal of this change is to simplify the code (removing a bunch
of instructions from the RX fast path), and to make the semantic of
isc_rxd_flush() consistent across drivers. To achieve this, we:
- change the semantics of the pidx argument to the usual one (that
is the index one past the last valid one), so that both iflib and
drivers avoid the decrement/increment dance.
- fix the initialization code to prepare at most N-1 descriptors.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26191
No functional changes.
Initially this function was created to perform runtime flag conversions
for the previous incarnation of fib lookup functions. As these functions
got deprecated, move the function to the file with the only remaining
caller. Lastly, rename it to convert_rt_to_nh_flags() to follow the
naming notation.
No functional changes.
net/route/shared.h was created in the inital phases of nexthop conversion.
It was intended to serve the same purpose as route_var.h - share definitions
of functions and structures between the routing subsystem components. At
that time route_var.h was included by many files external to the routing
subsystem, which largerly defeats its purpose.
As currently this is not the case anymore and amount of route_var.h includes
is roughly the same as shared.h, retire the latter in favour of the former.
As nexthops are immutable, some operations such as route attribute changes
require nexthop fetching, forking, modification and route switching.
These operations are not atomic, so they may need to be retried multiple
times in presence of multiple speakers changing the same route.
This change introduces "synchronisation" primitive: route_update_conditional(),
simplifying logic for route changes and upcoming multipath operations.
Differential Revision: https://reviews.freebsd.org/D26216
At initialization time, the netmap RX refill function used to
prepare the NIC RX ring with N-1 buffers rather than N (with
N equal to the number of descriptors in the NIC RX ring).
This is not how netmap is supposed to work, as it would keep
kring->nr_hwcur not in sync with the NIC "next index to refill"
(i.e., fl->ifl_pidx). Instead we prepare N buffers, although we
still publish (with isc_rxd_flush()) only the first N-1 buffers,
to avoid the NIC producer pointer to overrun the NIC consumer
pointer (for NICs where this is a real issue, e.g. Intel ones).
MFC after: 2 weeks
rtentry lock traditionally served 2 purposed: first was protecting refcounts,
the second was assuring consistent field access/changes.
Since route nexthop introduction, the need for the former disappeared and
the need for the latter reduced.
To be more precise, the following rte field are mutable:
rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info)
rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal)
rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info)
rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK)
rt_chain (deletion chain pointer, updated with RIB_WLOCK)
All of them are updated under RIB_WLOCK, so the only remaining concern is the reading.
rt_nhop and rt_weight (addressed in this review) are read under rib lock and
stored in the rib_cmd_info, so the caller has no problem with consitency.
rte_flags is currently read unlocked in rtsock reporting (however the scope
is only RTF_UP flag, which is pretty static).
rt_expire is currently read unlocked in rtsock reporting.
rt_chain accesses are safe, as this is only used at route deletion.
rt_expire and rte_flags reads will be dealt in a separate reviews soon.
Differential Revision: https://reviews.freebsd.org/D26162
The semantic of the pidx argument of isc_rxd_flush() is the
last valid index of in the free list, rather than the next
index to be published. However, netmap was still using the
old convention. While there, also refactor the netmap_fl_refill()
to simplify a little bit and add an assertion.
MFC after: 2 weeks
No functional changes.
Most of the routing flags are stored in the netxtop instead of rtentry.
Rename rt->rt_flags to rt->rte_flags to simplify reading/modifying code
checking routing flags.
In the new multipath code, rt->rt_nhop may actually point to nexthop group
instead of nhop. To ease transition, reduce the amount of rt->rt_nhop->...
accesses.
Differential Revision: https://reviews.freebsd.org/D26156
Allow to dynamically grow the amount of fibs in each vnet.
This change alters current behavior. Currently, if one defines
ROUTETABLES > 1 in the kernel config, each vnet will be created
with the number of fibs defined in the kernel config.
After this commit vnets will be created with fibs=1.
Dynamic net.fibs is not compatible with net.add_addr_allfibs.
The plan is to deprecate the latter and make
net.add_addr_allfibs=0 default behaviour.
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26062
boundry change the last two IF_Mbps(2500) and additionally one
IF_Mbps(5000) to ULL as well.
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate")
Remove unused arguments from dom_rtattach/dom_rtdetach functions and make
them return/accept 'struct rib_head' instead of 'void **'.
Declare inet/inet6 implementations in the relevant _var.h headers similar
to domifattach / domifdetach.
Add rib_subscribe_internal() function to accept subscriptions to the rnh
directly.
Differential Revision: https://reviews.freebsd.org/D26053
The lagg_clone_destroy() handles detach and waiting for ifconfig callers
to drain already.
This narrows the race for 2 panics that the tests triggered. Both were a
consequence of adding a port to the lagg device after it had already detached
from all of its ports. The link state task would run after lagg_clone_destroy()
free'd the lagg softc.
kernel:trap_fatal+0xa4
kernel:trap_pfault+0x61
kernel:trap+0x316
kernel:witness_checkorder+0x6d
kernel:_sx_xlock+0x72
if_lagg.ko:lagg_port_state+0x3b
kernel:if_down+0x144
kernel:if_detach+0x659
if_tap.ko:tap_destroy+0x46
kernel:if_clone_destroyif+0x1b7
kernel:if_clone_destroy+0x8d
kernel:ifioctl+0x29c
kernel:kern_ioctl+0x2bd
kernel:sys_ioctl+0x16d
kernel:amd64_syscall+0x337
kernel:trap_fatal+0xa4
kernel:trap_pfault+0x61
kernel:trap+0x316
kernel:witness_checkorder+0x6d
kernel:_sx_xlock+0x72
if_lagg.ko:lagg_port_state+0x3b
kernel:do_link_state_change+0x9b
kernel:taskqueue_run_locked+0x10b
kernel:taskqueue_run+0x49
kernel:ithread_loop+0x19c
kernel:fork_exit+0x83
PR: 244168
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D25284
After moving the route control plane code from net/route.c,
all rtzone users ended up being in net/route_ctl.c.
Move uma(9) rtzone setup/teardown code to net/route_ctl.c as well
to have everything in a single place.
While here, remove custom initializers from the zone.
It was added originally to avoid setup/teardown of costy per-cpu couters.
With these counters removed, the only remaining job was avoiding rte mutex
setup/teardown. Mutex setup is relatively cheap. Additionally, this mutex
will soon be removed. With that in mind, there is no sense in keeping
custom zone callbacks.
Differential Revision: https://reviews.freebsd.org/D26051
It is possible for rn_delete() to return NULL. If this happens, then set
*perror to ESRCH, as is done in the rest of the function.
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D25871
For drivers with IFLIB_HAS_RXCQ set, there is a separate completion
queue. In this case, the netmap rxsync routine needs to update
rxq->ifr_cq_cidx in the same way it is updated by iflib_rxeof().
This improves the situation for vmx(4) and bnxt(4) drivers, which
use iflib and have the IFLIB_HAS_RXCQ bit set.
PR: 248494
MFC after: 3 weeks
First, fix the initialization of the fl->ifl_rxd_idxs array,
which was affected by an off-by-one bug.
Once there, refactor the function to use better names for
local variables, optimize the variable assignments, and
merge the bus_dmamap_sync() inner loop with the outer one.
PR: 248494
MFC after: 3 weeks
Netmap only uses free list 0 to keep it consistent with its
one-to-one mapping between each netmap ring and a device RX
(or TX) queue.
However, the current iflib_netmap_rxsync() routine was
mistakenly updating the ifl_cidx field of both free lists.
PR: 248494
MFC after: 2 weeks
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches. They no longer serve any
purpose since UMA now provides their functionality by default. Remove
them to simplyify the kernel memory allocator interfaces a bit.
Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25937
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
In case the network device has a RX or TX control queue, the correct
number of TX/RX descriptors is contained in the second entry of the
isc_ntxd (or isc_nrxd) array, rather than in the first entry.
This case is correctly handled by iflib_device_register() and
iflib_pseudo_register(), but not by iflib_netmap_attach().
If the first entry is larger than the second, this can result in a
panic. This change fixes the bug by introducing two helper functions
that also lead to some code simplification.
PR: 247647
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D25541
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
While it doesn't trigger INVARIANTS or WITNESS on head it does in stable/12.
There's also no reason for it, as we can easily report the out of memory error
to the caller (i.e. userspace). All of these can already fail.
PR: 248046
MFC after: 3 days
if_attach() -> if_attach_internal() will call if_attachdomain1(ifp) any time
an ethernet interface is setup *after*
SI_SUB_PROTO_IFATTACHDOMAIN/SI_ORDER_FIRST. This eventually leads to
nd6_ifattach() -> nd6_setmtu0() stashing off ifp->if_mtu in ndi->maxmtu
*before* ifp->if_mtu has been properly set in some scenarios, e.g., USB
ethernet adapter plugged in later on.
For interfaces that are created in early boot, we don't have this issue as
domains aren't constructed enough for them to attach and thus it gets
deferred to domainifattach at SI_SUB_PROTO_IFATTACHDOMAIN/SI_ORDER_SECOND
*after* the mtu has been set earlier in ether_ifattach().
PR: 248005
Submitted by: Mathew <mjanelle blackberry com>
MFC after: 1 week
Old subscription model allowed only single customer.
Switch inet6 to the new subscription api and eliminate the old model.
Differential Revision: https://reviews.freebsd.org/D25615
Subscriptions are planned to be used by modules such as route lookup engines.
In that case that's the module task to properly unsibscribe before detach.
However, the in-kernel customer - inet6 wants to track default route changes.
To avoid having inet6 store per-fib subscriptions, handle automatic
destruction internally.
Differential Revision: https://reviews.freebsd.org/D25614
Now nhop_ref_object() unconditionally acquires a reference, and the new
nhop_try_ref_object() uses refcount_acquire_if_not_zero() to
conditionally acquire a reference. Since the former is cheaper, use it
when we know that the initial counter value is non-zero. No functional
change intended.
Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D25535
- Get rid of the ifl_vm_addrs array. It is not used by any existing
consumer, so we are just dirtying a couple of cache lines for no
reason.
- Use uma_zalloc(fl->ifl_zone) instead of m_cljget(). Otherwise
m_cljget() is doing unnecessary work to look up the correct zone, when
iflib already knows what that zone is.
- ifl_gen is only used when INVARIANTS is on, so make that more clear.
- Fix some style nits and inconsistencies.
Reviewed by: gallatin
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25490
When refilling an rx freelist, make sure we only update the hardware
producer index if at least one cluster was allocated. Otherwise the
NIC is programmed to write a previously used cluster, typically
resulting in a use-after-free when packet data is written by the
hardware.
Also make sure that we don't update the fragment index cursor if the
last allocation attempt didn't succeed. For at least Intel drivers,
iflib assumes that the consumer index and fragment index cursor stay in
lockstep, but this assumption was violated in the face of cluster
allocation failures.
Reported and tested by: pho
Reviewed by: gallatin, hselasky
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25489
fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees
over pointer validness and requiring on-stack data copying.
With no callers remaining, remove fib[46]_lookup_nh_ functions.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25445
The new function operates similarly to ifconfig_lagg_get_lagg_status and
likewise is accompanied by a function to free the bridge status data structure.
I have included in this patch the relocation of some strings describing STP
parameters and the PV2ID macro from ifconfig into net/if_bridgevar.h as they
are useful for consumers of libifconfig.
Reviewed by: kp, melifaro, mmacy
Approved by: mmacy (mentor)
MFC after: 1 week
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D25460
In the current iflib_netmap_rxsync, there is nothing that prevents
kring->nr_hwtail to overrun kring->nr_hwcur during the descriptor
import phase. This may cause errors in netmap applications, such as:
em1 RX0: fail 'head < kring->nr_hwcur || head > kring->nr_hwtail'
h 795 c 795 t 282 rh 795 rc 795 rt 282 hc 282 ht 282
Reviewed by: gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D25252
- a cloneattach failure will not currently be handled correctly,
jump to the right target
- pseudo devices are all treat as if they're ethernet devices -
this often doesn't make sense
MFC after: 1 week
Sponsored by: Netgate, Inc.
Differential Revision: https://reviews.freebsd.org/D25083
This OID was added in r17352 but the write path of IFDATA_LINKSPECIFIC
seems unused as there are no in-base writers, and as far as I can tell
we had issues with this code before, see PR 219472. Drop the write path
to make the handler read-only as described in comments and man-pages.
It can be marked as MPSAFE now.
Reviewed by: bdragon, kib, melifaro, wollman
Approved by: kib (mentor)
Sponsored by: Mysterious Code Ltd.
Differential Revision: https://reviews.freebsd.org/D25348
r286700 added the "lacp_fast_timeout" option to `ifconfig', but we forgot to
include the new option in the string used to decode the option bits. Add
"LACP_FAST_TIMO" to LAGG_OPT_BITS.
Also, s/LAGG_OPT_LACP_TIMEOUT/LAGG_OPT_LACP_FAST_TIMO/g , to be clearer that
the flag indicates "Fast Timeout" mode.
Reported by: Greg Foster <gfoster at panasas dot com>
Reviewed by: jpaetzel
MFC after: 1 week
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D25239
This simplifies the code and allows to further split rtentry and nexthop,
removing one of the blockers for multipath code introduction, described in
D24141.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25192
In the receive interrupt routine, always call netmap_rx_irq().
The latter function will return != NM_IRQ_PASS if netmap is not
active on that specific receive queue, so that the driver can go
on with iflib_rxeof(). Note that netmap supports partial opening,
where only a subset of the RX or TX rings can be open in netmap mode.
Checking the IFCAP_NETMAP flag is not enough to make sure that the
queue is indeed in netmap mode.
Moreover, in case netmap_rx_irq() returns NM_IRQ_RESCHED, it means
that netmap expects the driver to call netmap_rx_irq() again as soon
as possible. Currently, this may happen when the device is attached
to a VALE switch.
Reviewed by: gallatin
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25167
At this point, AES is the more common name for Rijndael128. setkey(8)
will still accept the old name, and old constants remain for
compatiblity.
Reviewed by: cem, bcr (manpages)
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24964
After r339550 tunneling interfaces have started handle appearing and
disappearing of ingress IP address on the host system.
When such interfaces are moving into VNET jail, they lose ability to
properly handle ifaddr_event_ext event. And this leads to need to
reconfigure tunnel to make it working again.
Since moving an interface into VNET jail leads to removing of all IP
addresses, it looks consistent, that tunnel configuration should also
be cleared. This is what will do if_reassing method.
Reported by: John W. O'Brien <john saltant com>
MFC after: 1 week
Currently there is no easy way of subscribing for the routing table changes.
The only existing way is to set ifa_rtrequest callback in the each protocol
ifaddr, which is not convenient or extandable.
This change provides generic notification subscription mechanism, that will
replace current ifa_rtrequest one and allow other applications such as
accelerated routing lookup modules subscribe for the changes.
In particular, this change provides 2 hooks: 1) synchronous one
(RIB_NOTIFY_IMMEDIATE), called under RIB_WLOCK, which ensures exact
ordering of the changes and 2) async one, (RIB_NOTIFY_DELAYED)
that is called after the change w/o holding locks. The latter one does not
provide any notification ordering guarantee.
Differential Revision: https://reviews.freebsd.org/D25070
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
returned in case of successful operation result. There are two problems with
this appoach. First is that it doesn't provide enough information for the
upcoming multipath changes, where rtentry refers to a new nexthop group,
and there is no way of guessing which paths were added during the change.
Second is that some rtentry fields can change during notification and
protecting from it by requiring customers to unlock rtentry is not desired.
Additionally, as the consumers such as rtsock do know which operation they
request in advance, making explicit add/change/del versions of the functions
makes sense, especially given the functions don't share a lot of code.
With that in mind, introduce rib_cmd_info notification structure and
rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
It will be used in upcoming generalized notifications.
* Move definitions of the new functions and some other functions/structures
used for the routing table manipulation to a separate header file,
net/route/route_ctl.h. net/route.h is a frequently used file included in
~140 places in kernel, and 90% of the users don't need these definitions.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25067
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
returned in case of successful operation result. There are two problems with
this appoach. First is that it doesn't provide enough information for the
upcoming multipath changes, where rtentry refers to a new nexthop group,
and there is no way of guessing which paths were added during the change.
Second is that some rtentry fields can change during notification and
protecting from it by requiring customers to unlock rtentry is not desired.
Additionally, as the consumers such as rtsock do know which operation they
request in advance, making explicit add/change/del versions of the functions
makes sense, especially given the functions don't share a lot of code.
With that in mind, introduce rib_cmd_info notification structure and
rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
It will be used in upcoming generalized notifications.
* Move definitions of the new functions and some other functions/structures
used for the routing table manipulation to a separate header file,
net/route/route_ctl.h. net/route.h is a frequently used file included in
~140 places in kernel, and 90% of the users don't need these definitions.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D25067
This fixes ipsec.ko to include all of IPSEC_DEBUG.
Reviewed by: imp
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D25046
multipath control plane changed described in D24141.
Currently route.c contains core routing init/teardown functions, route table
manipulation functions and various helper functions, resulting in >2KLOC
file in total. This change moves most of the route table manipulation parts
to a dedicated file, simplifying planned multipath changes and making
route.c more manageable.
Differential Revision: https://reviews.freebsd.org/D24870
After making rtentry reclamation backed by epoch(9) in r361409, there is
no reason in keeping reference counting code.
Differential Revision: https://reviews.freebsd.org/D24867
Currently the only reason of refcounting rtentries is the need to report
the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
is used by some of the customers to get the matching prefix along
with nexthops, in more efficient way.
The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .
Differential Revision: https://reviews.freebsd.org/D24866
Fix another collateral damage of r357614: netisr is initialised way before
malloc() is available hence it can't use sysctl_handle_string() that
allocates temporary buffer. Handle that internally in
sysctl_netisr_dispatch_policy().
PR: 246114
Reported by: delphij
Reviewed by: kib
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D24858
Right now we optionally allocate 8 counters per table entry, so in
addition to memory consumed by counters, we require 8 pointers worth of
space in each entry even when counters are not allocated (the default).
Instead, define a UMA zone that returns contiguous per-CPU counter
arrays for use in table entries. On amd64 this reduces sizeof(struct
pfr_kentry) from 216 to 160. The smaller size also results in better
slab efficiency, so memory usage for large tables is reduced by about
28%.
Reviewed by: kp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24843
__builtin_unreachable doesn't raise any compile-time warnings/errors on its
own, so problems with its usage can't be easily detected. While it would be
nice for this situation to change and compilers to at least add a warning
for trivial cases where local state means the instruction can't be reached,
this isn't the case at the moment and likely will not happen.
This commit adds an __assert_unreachable, whose intent is incredibly clear:
it asserts that this instruction is unreachable. On INVARIANTS builds, it's
a panic(), and on non-INVARIANTS it expands to __unreachable().
Existing users of __unreachable() are converted to __assert_unreachable,
to improve debuggability if this assumption is violated.
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D23793
rnh_close callbackes was used by the in[6]_clsroute() handlers,
doing cleanup in the route cloning code. Route cloning was eliminated
somewhere around r186119. Last callback user was eliminated in r186215,
11 years ago.
Differential Revision: https://reviews.freebsd.org/D24793
Last user of rtalloc1() KPI has been eliminated in rS360631.
As kernel is now fully switched to use new routing KPI defined in
rS359823, remove old lookup functions.
Differential Revision: https://reviews.freebsd.org/D24776
Currently each rtentry has dst&gateway allocated separately from another zone,
bloating cache accesses.
Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning,
leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64).
Fields needed for the datapath are destination sockaddr and rt_nhop.
So far it doesn't look like there is other routable addressing protocol other
than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes.
With that in mind, embed dst into struct rtentry, making the first 24 bytes
of rtentry within 128 bytes. That is enough to make IPv6 address within first
128 bytes.
It is still pretty easy to add code for supporting separately-allocated dst,
however it doesn't make a lot of sense in having such code without a use case.
As rS359823 moved the gateway to the nexthop structure, the dst embedding change
removes the need for any additional allocations done by rt_setgate().
Lastly, as a part of cleanup, remove counter(9) allocation code, as this field
is not used in packet processing anymore.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24669
Create rib_lookup() wrapper around per-af dataplane lookup functions.
This will help in the cases of having control plane af-agnostic code.
Switch ifa_ifwithroute() to use this function instead of rtalloc1().
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D24731
Eliminate the last rtalloc1() call to finish transition to the new routing
KPI defined in r359823.
Differential Revision: https://reviews.freebsd.org/D24663
After converting routing subsystem customers to use nexthop objects
defined in r359823, some fields in struct rtentry became unused.
This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
along with the code initializing and updating these fields.
Cleanup of the remaining fields will be addressed by D24669.
This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
nexthop and tries to update rte nhop pointer. Only last part is done under
WLOCK.
In the hypothetical scenarious where multiple rtsock clients
repeatedly issue RTM_CHANGE requests for the same route, route may get
updated between read and update operation. This is addressed by retrying
the operation multiple (3) times before returning failure back to the
caller.
Differential Revision: https://reviews.freebsd.org/D24666
With a length of 16, the name ("<if name>:TX(<qid>):callout") typically
gets truncated.
PR: 245712
Reported by: ghuckriede@blackberry.com
MFC after: 1 week
Continue routing subsystem conversion to nhop objects defined in r359823.
Use fields from nhop structure instead of "struct rtentry" fields.
This is one of the last changes prior to removing rt_ifp, rt_ifa,
rt_gateway and rt_mtu from struct rtentry.
Differential Revision: https://reviews.freebsd.org/D24609
With the upcoming multipath changes described in D24141,
rt->rt_nhop can potentially point to a nexthop group instead of
an individual nhop.
To simplify caller handling of such cases, change ifa_rtrequest() callback
to pass changed nhop directly.
Differential Revision: https://reviews.freebsd.org/D24604
Currently functionality resides in rtsock.c, which is a controlling
interface, partially external to the routing subsystem.
Additionally, DDB-supporting functionality is > 100SLOC, which deserves
a separate file.
Given that, move this functionality to a newly-created net/route/ subdir.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D24561
Nexthop objects implementation, defined in r359823,
introduced sys/net/route directory intended to hold all
routing-related code. Move recently-introduced route_temporal.c and
private route_var.h header there.
Differential Revision: https://reviews.freebsd.org/D24597
One of the goals of the new routing KPI defined in r359823
is to entirely hide`struct rtentry` from the consumers.
It will allow to improve routing subsystem internals and deliver
features much faster.
This is one of the last changes, effectively moving struct rtentry
definition to a net/route_var.h header, internal to the routing subsystem.
Differential Revision: https://reviews.freebsd.org/D24580
New fib[46]_lookup() functions support multipath transparently.
Given that, switch the last rtalloc_mpath_fib() calls to
dib4_lookup() and eliminate the function itself.
Note: proper flowid generation (especially for the outbound traffic) is a
bigger topic and will be handled in a separate review.
This change leaves flowid generation intact.
Differential Revision: https://reviews.freebsd.org/D24595
r360292 switched most of the remaining routing customers to a new KPI,
leaving a bunch of wrappers for old routing lookup functions unused.
Remove them from the tree as a part of routing cleanup.
Differential Revision: https://reviews.freebsd.org/D24569
This patch is intended to solve a specific problem that iavf(4)
encounters, but what it does can be extended to solve other issues.
To summarize the iavf(4) issue, if the PF driver configures VLAN
anti-spoof, then the VF driver needs to make sure no untagged traffic is
sent if a VLAN is configured, and vice-versa. This can be an issue when
a VLAN is being registered or unregistered, e.g. when a packet may be on
the ring with a VLAN in it, but the VLANs are being unregistered. This
can cause that tagged packet to go out and cause an MDD event.
To fix this, include a new interface-dependent function that drivers can
implement named IFDI_NEEDS_RESTART(). Right now, this function is called
in iflib_vlan_unregister/register() to determine whether the interface
needs to be stopped and started when a VLAN is registered or
unregistered. The default return value of IFDI_NEEDS_RESTART() is true,
so this fixes the MDD problem that iavf(4) encounters, since the
interface rings are flushed during a stop/init.
A future change to iavf(4) will implement that function just in case the
default value changes, and to make it explicit that this interface reset
is required when a VLAN is added or removed.
Reviewed by: gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22086
Introduce new fib[46]_lookup_debugnet() functions serving as a
special interface for the crash-time operations. Underlying
implementation will try to return lookup result if
datastructures are not corrupted, avoding locking.
Convert debugnet to use fib4_lookup_debugnet() and switch it
to use nexthops instead of rtentries.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D24555
Run the bridge datapath under epoch, rather than under the
BRIDGE_LOCK().
We still take the BRIDGE_LOCK() whenever we insert or delete items in
the relevant lists, but we use epoch callbacks to free items so that
it's safe to iterate the lists without the BRIDGE_LOCK.
Tests on mercat5/6 shows this increases bridge throughput significantly,
from 3.7Mpps to 18.6Mpps.
Reviewed by: emaste, philip, melifaro
MFC after: 2 months
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24250
This change is build on top of nexthop objects introduced in r359823.
Nexthops are separate datastructures, containing all necessary information
to perform packet forwarding such as gateway interface and mtu. Nexthops
are shared among the routes, providing more pre-computed cache-efficient
data while requiring less memory. Splitting the LPM code and the attached
data solves multiple long-standing problems in the routing layer,
drastically reduces the coupling with outher parts of the stack and allows
to transparently introduce faster lookup algorithms.
Route caching was (re)introduced to minimise (slow) routing lookups, allowing
for notably better performance for large TCP senders. Caching works by
acquiring rtentry reference, which is protected by per-rtentry mutex.
If the routing table is changed (checked by comparing the rtable generation id)
or link goes down, cache record gets withdrawn.
Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
cached object, which is mostly mechanical. Other moving parts like cache
cleanup on rtable change remains the same.
Differential Revision: https://reviews.freebsd.org/D24340
One of the goals of the new routing KPI defined in r359823 is to entirely
hide`struct rtentry` from the consumers. It will allow to improve routing
subsystem internals and deliver more features much faster.
This commit is mostly mechanical change to eliminate direct struct rtentry
field accesses.
The only notable difference is AF_LINK gateway encoding.
AF_LINK gw is used in routing stack for operations with interface routes
and host loopback routes.
In the former case it indicates _some_ non-NULL gateway, as the interface
is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting.
In the latter case the interface index inside gateway was used by the IPv6
datapath to verify address scope for link-local interfaces.
Kernel uses struct sockaddr_dl for this type of gateway. This structure
allows for specifying rich interface data, such as mac address and interface
name. However, this results in relatively large structure size - 52 bytes.
Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside
in the first 8 bytes of the structure.
In the new KPI, struct nhop_object tries to be cache-efficient, hence
embodies gateway address inside the structure. In the AF_LINK case it
stores stortened version of the structure - struct sockaddr_dl_short,
which occupies 16 bytes. After D24340 changes, the data inside AF_LINK
gateway will not be used in the kernel at all, leaving rtsock as the only
potential concern.
The difference in rtsock reporting:
(old)
got message of size 240 on Thu Apr 16 03:12:13 2020
RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0
(new)
got message of size 200 on Sun Apr 19 09:46:32 2020
RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 link#5 255.255.255.0
Note 40 bytes different (52-16 + alignment).
However, gateway is still a valid AF_LINK gateway with proper data filled in.
It is worth noting that these particular messages (interface routes) are mostly
ignored by routing daemons:
* bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages.
* quagga/frr ignores routes without gateway
More detailed overview on how rtsock messages are used by the
routing daemons to reconstruct the kernel view, can be found in D22974.
Differential Revision: https://reviews.freebsd.org/D24519
Unconditionally use ether_gen_addr() to generate bridge mac addresses. This
function is now less likely to generate duplicate mac addresses across jails.
The old hand rolled hostid based code adds no value.
Reviewed by: bz
Differential Revision: https://reviews.freebsd.org/D24432
If we create two (vnet) jails and create a bridge interface in each we end up
with the same mac address on both bridge interfaces.
These very often conflicts, resulting in same mac address in both jails.
Mitigate this problem by including the jail name in the mac address.
Reviewed by: kevans, melifaro
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D24383
Generic if_output() callback signature was modified to use struct route
instead of struct rtentry in r191148, back in 2009.
Quoting commit message:
Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2
Fix bridge_output() to match this signature and update the remaining
comment in if_var.h.
Reviewed by: kp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24394