Possibly one could assert that ret should always be 0 here (that is,
that there was always an index found in the bitmask). That should be
true since a bitmask index is allocated before the nhgrp is inserted
in the ctl->gr_head list in link_nhgrp.
With IPv4 over IPv6 nexthops and IP->MPLS support, there is a need
to distingush "upper" e.g. traffic family and "neighbor" e.g. LLE/gateway
address family. Store them explicitly in the private part of the nexthop data.
While here, store nhop fibnum in nhop_prip datastructure to make it self-contained.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D33663
Yet another problem created by VIMAGE/if_vmove/epair design that
relocates ifnet between vnets and changes if_index. Since if_index
changes, nhop hash values also changes, unlink_nhop() isn't able to
find entry in hash and leaks the nhop. Since nhop references ifnet,
the latter is also leaked. As result running network tests leaks
memory on every single test that creates vnet jail.
While here, rewrite whole hash_priv() to use static initializer,
per Alexander's suggestion.
Reviewed by: melifaro
Adding such nexthops breaks calc_min_mpath_slots() assumptions,
thus resulting in the incorrect nexthop group creation and
eventually leading to panic.
Reported by: avg
MFC after: 1 week
Some software references outgoing interfaces by specifying name instead of
index.
Use rti_ifp from rt_addrinfo if provided instead of always using
address interface when constructing nexthop.
PR: 255678
Reported by: martin.larsson2 at gmail.com
MFC after: 1 week
Implement kernel support for RFC 5549/8950.
* Relax control plane restrictions and allow specifying IPv6 gateways
for IPv4 routes. This behavior is controlled by the
net.route.rib_route_ipv6_nexthop sysctl (on by default).
* Always pass final destination in ro->ro_dst in ip_forward().
* Use ro->ro_dst to exract packet family inside if_output() routines.
Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.
* Pass extracted family to nd6_resolve() to get the LLE with proper encap.
It leverages recent lltable changes committed in c541bd368f.
Presence of the functionality can be checked using ipv4_rfc5549_support feature(3).
Example usage:
route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0
Differential Revision: https://reviews.freebsd.org/D30398
MFC after: 2 weeks
When a prefix gets deleted from the RIB, dpdk_lpm algo needs to know
the nexthop of the "parent" prefix to update its internal state.
The glue code, which utilises RIB as a backing route store, uses
fib[46]_lookup_rt() for the prefix destination after its deletion
to fetch the desired nexthop.
This approach does not work when deleting less-specific prefixes
with most-specific ones are still present. For example, if
10.0.0.0/24, 10.0.0.0/23 and 10.0.0.0/22 exist in RIB, deleting
10.0.0.0/23 would result in 10.0.0.0/24 being returned as a search
result instead of 10.0.0.0/22. This, in turn, results in the failed
datastructure update: part of the deleted /23 prefix will still
contain the reference to an old nexthop. This leads to the
use-after-free behaviour, ending with the eventual crashes.
Fix the logic flaw by properly fetching the prefix "parent" via
newly-created rt_get_inet[6]_parent() helpers.
Differential Revision: https://reviews.freebsd.org/D31546
PR: 256882,256833
MFC after: 1 week
Consistently use `nh` instead of always dereferencing
ro->ro_nh inside the if block.
Always use nexthop mtu, as it provides guarantee that mtu is accurate.
Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses
of updating ro flags based on different nexthop.
Differential Revision: https://reviews.freebsd.org/D31451
Reviewed by: kp
MFC after: 2 weeks
When certain multipath route begins flapping really fast, it may
result in creating multiple identical nexthop groups. The code
responsible for unlinking unused nexthop groups had an implicit
assumption that there could be only one nexthop group for the
same combination of nexthops with weights. This assumption resulted
in always unlinking the first "identical" group, instead of the
desired one. Such action, in turn, produced a used-but-unlinked
nhg along with freed-and-linked nhg, ending up in random crashes.
Similarly, it is possible that multiple identical nexthops gets
created in the case of high route churn, resulting in the same
problem when deleting one of such nexthops.
Fix by matching the nexthop/nexhop group pointer when deleting the item.
Reported by: avg
MFC after: 1 week
IF non-existend gateway was specified, the code responsible for calculating
an updated nexthop group, returned the same already-used nexthop group.
After the route table update, the operation result contained the same
old & new nexthop groups. Thus, the code responsible for decomposing
the notification to the list of simple nexthop-level notifications,
was not able to find any differences. As a result, it hasn't updated any
of the "simple" notification fields, resulting in empty rtentry pointer.
This empty pointer was the direct reason of a panic.
Fix the problem by returning ESRCH when the new nexthop group is the same
as the old one after applying gateway filter.
Reported by: Michael <michael.adm at gmail.com>
PR: 255665
MFC after: 3 days
Provide wrapper for the rnh_walktree_from() rib callback.
As currently `struct rib_head` is considered internal to the
routing subsystem, this wrapper is necessary to maintain isolation
from the external code.
Differential Revision: https://reviews.freebsd.org/D29971
MFC after: 1 week
Currently, most of the rib(9) KPI does not use rnh pointers, using
fibnum and family parameters to determine the rib pointer instead.
This works well except for the case when we initialize new rib pointers
during fib growth.
In that case, there is no mapping between fib/family and the new rib,
as an entirely new rib pointer array is populated.
Address this by delaying fib algo initialization till after switching
to the new pointer array and updating the number of fibs.
Set datapath pointer to the dummy function, so the potential callers
won't crash the kernel in the brief moment when the rib exists, but
no fib algo is attached.
This change allows to avoid creating duplicates of existing rib functions,
with altered signature.
Differential Revision: https://reviews.freebsd.org/D29969
MFC after: 1 week
Modular fib lookup framework features logic that allows
route update batching for the algorithms that cannot easily
apply the routing change without rebuilding. As a result,
dataplane lookups may return old data until the the sync
takes place. With the default sync timeout of 50ms, it is
possible that new binary like ping(8) executed exactly after
route(8) will still use the old fib data.
To address some aspects of the problem, framework executes
all rtable changes without RTF_GATEWAY synchronously.
To fix the aforementioned problem, this diff extends sync
execution for all RTF_STATIC routes (e.g. ones maintained by
route(8).
This fixes a bunch of tests in the networking space.
Reported by: ci, arichardson
MFC after: 2 weeks
33cb3cb2e3 introduced an `rib_head` structure field under the
FIB_ALGO define. This may be problematic for the CTF, as some
of the files including `route_var.h` do not have `fib_algo`
defined.
Make dtrace happy by making the field unconditional.
Suggested by: markj
Currently, PCB caching mechanism relies on the rib generation
counter (rnh_gen) to invalidate cached nhops/LLE entries.
With certain fib algorithms, it is now possible that the
datapath lookup state applies RIB changes with some delay.
In that scenario, PCB cache will invalidate on the RIB change,
but the new lookup may result in the same nexthop being returned.
When fib algo finally gets in sync with the RIB changes, PCB cache
will not receive any notification and will end up caching the stale data.
To fix this, introduce additional counter, rnh_gen_rib, which is used
only when FIB_ALGO is enabled.
This counter is incremented by the control plane. Each time when fib algo
synchronises with the RIB, it updates rnh_gen to the current rnh_gen_rib value.
Differential Revision: https://reviews.freebsd.org/D29812
Reviewed by: donner
MFC after: 2 weeks
Fib algo uses a per-family array indexed by the fibnum to store
lookup function pointers and per-fib data.
Each algorithm rebuild currently requires re-allocating this array
to support atomic change of two pointers.
As in reality most of the changes actually involve changing only
data pointer, add a shortcut performing in-flight pointer update.
MFC after: 2 weeks
Some algorithms may require updating datapath and control plane
algo pointers after the (batched) updates.
Export fib_set_datapath_ptr() to allow setting the new datapath
function or data pointer from the algo.
Add fib_set_algo_ptr() to allow updating algo control plane
pointer from the algo.
Add fib_epoch_call() epoch(9) wrapper to simplify freeing old
datapath state.
Reviewed by: zec
Differential Revision: https://reviews.freebsd.org/D29799
MFC after: 1 week
Initial fib algo implementation was build on a very simple set of
principles w.r.t updates:
1) algorithm is ether able to apply the change synchronously (DIR24-8)
or requires full rebuild (bsearch, lradix).
2) framework falls back to rebuild on every error (memory allocation,
nhg limit, other internal algo errors, etc).
This changes brings the new "intermediate" concept - batched updates.
Algotirhm can indicate that the particular update has to be handled in
batched fashion (FLM_BATCH).
The framework will write this update and other updates to the temporary
buffer instead of pushing them to the algo callback.
Depending on the update rate, the framework will batch 50..1024 ms of updates
and submit them to a different algo callback.
This functionality is handy for the slow-to-rebuild algorithms like DXR.
Differential Revision: https://reviews.freebsd.org/D29588
Reviewed by: zec
MFC after: 2 weeks
The intent is to better handle time intervals with large amount of RIB
updates (e.g. BGP peer going up or down), while still keeping low sync
delay for the rest scenarios.
The implementation is the following: updates are bucketed into the
buckets of size 50ms. If the number of updates within a current bucket
exceeds the threshold of 500 routes/sec (e.g. 10 updates per bucket
interval), the update is delayed for another 50ms. This can be repeated
until the maximum update delay (1 sec) is reached.
All 3 variables are runtime tunables:
* net.route.algo.fib_max_sync_delay_ms: 1000
* net.route.algo.bucket_change_threshold_rate: 500
* net.route.algo.bucket_time_ms: 50
Differential Review: https://reviews.freebsd.org/D29588
MFC after: 2 weeks
The current code has the limit of 127 nexthop groups due to the
wrongly-checked bitmask_copy() return value.
PR: 254303
Reported by: Aleks <a.ivanov at veesp.com>
MFC after: 1 day
`struct weightened_nhop` has spare 32bit between the fields due to
the alignment (on amd64).
Not zeroing these spare bits results in duplicating nhop groups
in the kernel due to the way how comparison works.
MFC after: 1 day
In case with batch route delete via rib_walk_del(), when
some paths from the multipath route gets deleted, old
multipath group were not freed.
PR: 254496
Reported by: Zhenlei Huang <zlei.huang@gmail.com>
MFC after: 1 day
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
VNET.
PR: 253998
Reported by: rashey at superbox.pl
MFC after: 3 days
Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.
Test Plan:
```
set_skip:set_skip_group_lo -> passed [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```
Reviewers: #network, kp, bz
Reviewed By: kp
Subscribers: imp, ae
Differential Revision: https://reviews.freebsd.org/D29116
The routing stack control depends on quite a tree of functions to
determine the proper attributes of a route such as a source address (ifa)
or transmit ifp of a route.
When actually inserting a route, the stack needs to ensure that ifa and ifp
points to the entities that are still valid.
Validity means slightly more than just pointer validity - stack need guarantee
that the provided objects are not scheduled for deletion.
Currently, callers either ignore it (most ifp parts, historically) or try to
use refcounting (ifa parts). Even in case of ifa refcounting it's not always
implemented in fully-safe manner. For example, some codepaths inside
rt_getifa_fib() are referencing ifa while not holding any locks, resulting in
possibility of referencing scheduled-for-deletion ifa.
Instead of trying to fix all of the callers by enforcing proper refcounting,
switch to a different model.
As the rib_action() already requires epoch, do not require any stability guarantees
other than the epoch-provided one.
Use newly-added conditional versions of the refcounting functions
(ifa_try_ref(), if_try_ref()) and fail if any of these fails.
Reviewed by: donner
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28837
The case of adding interface route by specifying interface
address as the gateway was missed during code refactoring.
Re-add it back by copying non-AF_LINK gateway data when RTF_GATEWAY
is not set.
Reviewed by: donner
MFC after: 3 days