Commit Graph

107 Commits

Author SHA1 Message Date
Alexander V. Chernikov
41ce0e34ea [fib algo] Update fib_gen counter under FIB_MOD_LOCK.
MFC after:	3 days
2021-04-28 20:23:03 +00:00
Alexander V. Chernikov
f9668e42b4 Add rib_walk_from() wrapper for selective rib tree traversal.
Provide wrapper for the rnh_walktree_from() rib callback.
As currently `struct rib_head` is considered internal to the
 routing subsystem, this wrapper is necessary to maintain isolation
 from the external code.

Differential Revision: https://reviews.freebsd.org/D29971
MFC after:	1 week
2021-04-28 08:09:45 +00:00
Alexander V. Chernikov
8a0d57baec [fib algo] Delay algo init at fib growth to to allow to reliably use rib KPI.
Currently, most of the rib(9) KPI does not use rnh pointers, using
 fibnum and family parameters to determine the rib pointer instead.
This works well except for the case when we initialize new rib pointers
 during fib growth.
In that case, there is no mapping between fib/family and the new rib,
 as an entirely new rib pointer array is populated.

Address this by delaying fib algo initialization till after switching
 to the new pointer array and updating the number of fibs.
Set datapath pointer to the dummy function, so the potential callers
 won't crash the kernel in the brief moment when the rib exists, but
 no fib algo is attached.

This change allows to avoid creating duplicates of existing rib functions,
 with altered signature.

Differential Revision: https://reviews.freebsd.org/D29969
MFC after:	1 week
2021-04-27 22:10:08 +00:00
Alexander V. Chernikov
439d087d0b [fib algo] always commit static routes synchronously.
Modular fib lookup framework features logic that allows
 route update batching for the algorithms that cannot easily
 apply the routing change without rebuilding. As a result,
 dataplane lookups may return old data until the the sync
 takes place. With the default sync timeout of 50ms, it is
 possible that new binary like ping(8) executed exactly after
 route(8) will still use the old fib data.

To address some aspects of the problem, framework executes
 all rtable changes without RTF_GATEWAY synchronously.

To fix the aforementioned problem, this diff extends sync
 execution for all RTF_STATIC routes (e.g. ones maintained by
 route(8).
This fixes a bunch of tests in the networking space.

Reported by:	ci, arichardson
MFC after:	2 weeks
2021-04-27 08:31:40 +00:00
Alexander V. Chernikov
bc5ef45aec Fix drace CTF for the rib_head.
33cb3cb2e3 introduced an `rib_head` structure field under the
FIB_ALGO define. This may be problematic for the CTF, as some
 of the files including `route_var.h` do not have `fib_algo`
 defined.

Make dtrace happy by making the field unconditional.

Suggested by:	markj
2021-04-27 07:47:53 +00:00
Alexander V. Chernikov
7d222ce3c1 Fix NOINET[6],!VIMAGE builds after FIB_ALGO addition to GENERIC
Reported by:	jbeich
PR:		255390
2021-04-21 05:53:42 +01:00
Alexander V. Chernikov
67372fb3e0 Fix NOINET[6] build after enabling FIB_ALGO in GENERIC.
Submitted by:	jbeich
PR:		255389
2021-04-21 02:49:18 +01:00
Alexander V. Chernikov
c23385612d [fib algo] Do not print algo attach/detach message on boot
MFC after:	1 day
2021-04-25 08:58:06 +00:00
Alexander V. Chernikov
a81e2e7890 Make gcc happy by initializing error in rib_handle_ifaddr_info(). 2021-04-25 08:44:59 +00:00
Stefan Eßer
6409e59427 Fix build with gcc
Correctly declare function without arguments as f(void) instead of f().
2021-04-25 10:15:17 +02:00
Alexander V. Chernikov
33cb3cb2e3 Fix rib generation count for fib algo.
Currently, PCB caching mechanism relies on the rib generation
 counter (rnh_gen) to invalidate cached nhops/LLE entries.

With certain fib algorithms, it is now possible that the
 datapath lookup state applies RIB changes with some delay.
In that scenario, PCB cache will invalidate on the RIB change,
 but the new lookup may result in the same nexthop being returned.
When fib algo finally gets in sync with the RIB changes, PCB cache
 will not receive any notification and will end up caching the stale data.

To fix this, introduce additional counter, rnh_gen_rib, which is used
 only when FIB_ALGO is enabled.
This counter is incremented by the control plane. Each time when fib algo
 synchronises with the RIB, it updates rnh_gen to the current rnh_gen_rib value.

Differential Revision: https://reviews.freebsd.org/D29812
Reviewed by:	donner
MFC after:	2 weeks
2021-04-20 22:02:41 +00:00
Alexander V. Chernikov
0abb6ff590 fib algo: do not reallocate datapath index for datapath ptr update.
Fib algo uses a per-family array indexed by the fibnum to store
 lookup function pointers and per-fib data.

Each algorithm rebuild currently requires re-allocating this array
 to support atomic change of two pointers.

As in reality most of the changes actually involve changing only
 data pointer, add a shortcut performing in-flight pointer update.

MFC after:	2 weeks
2021-04-18 16:12:13 +01:00
Alexander V. Chernikov
e2f79d9e51 Fib algo: extend KPI by allowing algo to set datapath pointers.
Some algorithms may require updating datapath and control plane
 algo pointers after the (batched) updates.

Export fib_set_datapath_ptr() to allow setting the new datapath
 function or data pointer from the algo.
Add fib_set_algo_ptr() to allow updating algo control plane
 pointer from the algo.
Add fib_epoch_call() epoch(9) wrapper to simplify freeing old
 datapath state.

Reviewed by:		zec
Differential Revision: https://reviews.freebsd.org/D29799
MFC after:		1 week
2021-04-18 16:12:12 +01:00
Alexander V. Chernikov
6b8ef0d428 Add batched update support for the fib algo.
Initial fib algo implementation was build on a very simple set of
 principles w.r.t updates:

1) algorithm is ether able to apply the change synchronously (DIR24-8)
 or requires full rebuild (bsearch, lradix).
2) framework falls back to rebuild on every error (memory allocation,
 nhg limit, other internal algo errors, etc).

This changes brings the new "intermediate" concept - batched updates.
Algotirhm can indicate that the particular update has to be handled in
 batched fashion (FLM_BATCH).
The framework will write this update and other updates to the temporary
 buffer instead of pushing them to the algo callback.
Depending on the update rate, the framework will batch 50..1024 ms of updates
 and submit them to a different algo callback.

This functionality is handy for the slow-to-rebuild algorithms like DXR.

Differential Revision:	https://reviews.freebsd.org/D29588
Reviewed by:	zec
MFC after:	2 weeks
2021-04-14 23:54:11 +01:00
Alexander V. Chernikov
ee2cf2b360 Implement better rebuild-delay fib algo policy.
The intent is to better handle time intervals with large amount of RIB
updates (e.g. BGP peer going up or down), while still keeping low sync
delay for the rest scenarios.

The implementation is the following: updates are bucketed into the
buckets of size 50ms. If the number of updates within a current bucket
 exceeds the threshold of 500 routes/sec (e.g. 10 updates per bucket
interval), the update is delayed for another 50ms. This can be repeated
 until the maximum update delay (1 sec) is reached.

All 3 variables are runtime tunables:

* net.route.algo.fib_max_sync_delay_ms: 1000
* net.route.algo.bucket_change_threshold_rate: 500
* net.route.algo.bucket_time_ms: 50

Differential Review:	https://reviews.freebsd.org/D29588
MFC after:		2 weeks
2021-04-09 21:33:03 +01:00
Alexander V. Chernikov
0c2a0e0380 Fix typo in the 9fa8d1582b.
Reported by:	cy
2021-03-29 23:42:48 +00:00
Alexander V. Chernikov
9fa8d1582b Put bandaid for nhgrp_dump_sysctl() malloc KASSERT().
Recent rtsock changes widened epoch and covered nhgrp_dump_sysctl(),
  resulting in `netstat -4On` triggering with KASSERT.

MFC after:	1 day
2021-03-29 23:12:11 +00:00
Alexander V. Chernikov
0f30a36ded Rename variables inside nexhtop group consider_resize() code.
No functional changes.

MFC after: 3 days
2021-03-29 23:06:13 +00:00
Alexander V. Chernikov
9095dc7da4 Fix nexhtop group index array scaling.
The current code has the limit of 127 nexthop groups due to the
 wrongly-checked bitmask_copy() return value.

PR: 254303
Reported by:	Aleks <a.ivanov at veesp.com>
MFC after: 1 day
2021-03-29 23:00:17 +00:00
Alexander V. Chernikov
6f43c72b47 Zero struct weightened_nhop fields in nhgrp_get_addition_group().
`struct weightened_nhop` has spare 32bit between the fields due to
 the alignment (on amd64).
Not zeroing these spare bits results in duplicating nhop groups
 in the kernel due to the way how comparison works.

MFC after:	1 day
2021-03-20 08:26:03 +00:00
Alexander V. Chernikov
24cd2796cf Fix !VNET build broken by 66f138563b. 2021-03-25 00:31:08 +00:00
Alexander V. Chernikov
66f138563b Plug nexthop group refcount leak.
In case with batch route delete via rib_walk_del(), when
 some paths from the multipath route gets deleted, old
 multipath group were not freed.

PR:    254496
Reported by:   Zhenlei Huang <zlei.huang@gmail.com>
MFC after:     1 day
2021-03-24 23:52:18 +00:00
Alexander V. Chernikov
c00e2f573b Fix build for non-vnet non-multipath kernels broken by
a0308e48ec.
2021-03-23 23:35:23 +00:00
Alexander V. Chernikov
a0308e48ec Fix panic when destroying interface with ECMP routes.
Reported by:	Zhenlei Huang <zlei.huang at gmail.com>
PR:		254496
MFC after:	immediately
2021-03-23 22:03:20 +00:00
Alexander V. Chernikov
2476178e6b Fix kassert panic when inserting multipath routes from multiple threads.
Reported by:	Marco Zec <zec at fer.hr>
MFC after:	immediately
2021-03-21 18:15:29 +00:00
Alexander V. Chernikov
e4ac3f7463 Fix fib algo rebuild delay calculation.
Submitted by:	Marco Zec <zec at fer.hr>
MFC after:	3 days
2021-03-15 21:09:07 +00:00
Alexander V. Chernikov
b1d63265ac Flush remaining routes from the routing table during VNET shutdown.
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
 VNET.

PR:	253998
Reported by:	rashey at superbox.pl
MFC after:	3 days

Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.

Test Plan:
```
set_skip:set_skip_group_lo  ->  passed  [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```

Reviewers: #network, kp, bz

Reviewed By: kp

Subscribers: imp, ae

Differential Revision: https://reviews.freebsd.org/D29116
2021-03-10 21:10:14 +00:00
Alexander V. Chernikov
5964172837 Simplify ifa/ifp refcounting in the routing stack.
The routing stack control depends on quite a tree of functions to
 determine the proper attributes of a route such as a source address (ifa)
 or transmit ifp of a route.

When actually inserting a route, the stack needs to ensure that ifa and ifp
 points to the entities that are still valid.
Validity means slightly more than just pointer validity - stack need guarantee
 that the provided objects are not scheduled for deletion.

Currently, callers either ignore it (most ifp parts, historically) or try to
 use refcounting (ifa parts). Even in case of ifa refcounting it's not always
 implemented in fully-safe manner. For example, some codepaths inside
 rt_getifa_fib() are referencing ifa while not holding any locks, resulting in
 possibility of referencing scheduled-for-deletion ifa.

Instead of trying to fix all of the callers by enforcing proper refcounting,
 switch to a different model.
As the rib_action() already requires epoch, do not require any stability guarantees
 other than the epoch-provided one.
Use newly-added conditional versions of the refcounting functions
 (ifa_try_ref(), if_try_ref()) and fail if any of these fails.

Reviewed by:	donner
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28837
2021-02-22 23:37:59 +00:00
Alexander V. Chernikov
a375ec52a7 Fix ifa refcount leak during route addition.
Reported by:	rstone
Reviewed by:	rstone
MFC after:	1 day
2021-02-13 00:06:14 +00:00
Alexander V. Chernikov
8170a7d438 Fix interface route addition with net/bird.
The case of adding interface route by specifying interface
 address as the gateway was missed during code refactoring.
Re-add it back by copying non-AF_LINK gateway data when RTF_GATEWAY
 is not set.

Reviewed by:	donner
MFC after:	3 days
2021-02-12 19:45:35 +00:00
Alexander V. Chernikov
adc4ea97bd Turn off forgotten multipath debug messages
Reported by:	mike tancsa<mike at sentex.net>
MFC after:	3 days
2021-02-08 21:42:20 +00:00
Alexander V. Chernikov
eb0b1b33d5 Enable multipath routing by default.
ROUTE_MPATH was added to the GENERIC kernel in r368648.

According to the plan in D27428, it was enabled with `net.route.multipath` sysctl set to 0.
Given enough time has passed, this change enables route multipath by default.

The goal is to ship FreeBSD 13 with multipath turned on.

Reviewed By: donner, olivier
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28423
2021-02-03 08:49:58 +00:00
Alexander V. Chernikov
78c93a1721 Use process fib for inet/inet6 fib_algo sysctls.
This allows to set/query fib algo for non-default fibs.

MFC after:	3 days
2021-01-31 10:50:08 +00:00
Alexander V. Chernikov
151ec796a2 Fix the design problem with delayed algorithm sync.
Currently, if the immutable algorithm like bsearch or radix_lockless
 receives rtable update notification, it schedules algorithm rebuild.
This rebuild is executed by the callout after ~50 milliseconds.

It is possible that a script adding an interface address and than route
 with the gateway bound to that address will fail. It can happen due
 to the fact that fib is not updated by the time the route addition
 request arrives.

Fix this by allowing synchronous algorithm rebuilds based on certain
 conditions. By default, these conditions assume:
1) less than net.route.algo.fib_sync_limit=100 routes
2) routes without gateway.

* Move algo instance build entirely under rib WLOCK.
 Rib lock is only used for control plane (except radix algo, but there
  are no rebuilds).
* Add rib_walk_ext_locked() function to allow RIB iteration with
 rib lock already held.
* Fix rare potential callout use-after-free for fds by binding fd
 callout to the relevant rib rmlock. In that case, callout_stop()
 under rib WLOCK guarantees no callout will be executed afterwards.

MFC after:	3 days
2021-01-30 23:25:57 +00:00
Alexander V. Chernikov
dd9163003c Add rib_subscribe_locked() and rib_unsubsribe_locked() to support
subscriptions during RIB modifications.
Add new subscriptions to the beginning of the lists instead of
 the end. This fixes the situation when new subscription is created
 int the callback for the existing subscription, leading to the
 subscription notification handler pick it.

MFC after: 3 days
2021-01-30 23:25:57 +00:00
Alexander V. Chernikov
ab6d9aaed7 Move business logic from rebuild_fd_callout() into rebuild_fd().
This simplifies code a bit and allows for future non-callout
 callers to request rebuild.

MFC after:	3 days
2021-01-30 23:25:57 +00:00
Alexander V. Chernikov
f8b7ebea49 Improve fib_algo debug messages.
* Move per-prefix debug lines under LOG_DEBUG2
* Create fib instance counter to distingush log messages between
 instances
* Add more messages on rebuild reason.

MFC after:	3 days
2021-01-30 23:25:56 +00:00
Alexander V. Chernikov
9d6567bc30 Fix panic on vnet creation if fib algo has been set to fixed value.
Make fixed algo property per-VNET instead of global.
2021-01-17 20:32:25 +00:00
Alexander V. Chernikov
81728a538d Split rtinit() into multiple functions.
rtinit[1]() is a function used to add or remove interface address prefix routes,
  similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
 in reality.

1) IPv6 code does not use it for the ifa routes. There is a separate layer,
  nd6_prelist_(), providing interface for maintaining interface routes. Its part,
  responsible for the actual route table interaction, mimics rtenty() code.

2) rtinit tries to combine multiple actions in the same function: constructing
  proper route attributes and handling iterations over multiple fibs, for the
  non-zero net.add_addr_allfibs use case. It notably increases the code complexity.

3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
 for p2p connections, host routes and p2p routes are handled in the same way.
 Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
 It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
 aliases.

4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
 complicating rib_action() implementation.

5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
 ifa messages in certain corner cases.

To address all these points, the following has been done:

* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
 dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
 is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
 responsible for the actual route notifications.

Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses

Differential revision:	https://reviews.freebsd.org/D28186
2021-01-16 22:42:41 +00:00
Alexander V. Chernikov
685de460bc Use static initializers for fib algo to shift initialization
to ealier stage. This allows to register modules loaded at
 boot time.

Reported by:	olivier
2021-01-11 00:16:54 +00:00
Alexander V. Chernikov
d68cf57b7f Refactor rt_addrmsg() and rt_routemsg().
Summary:
* Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally.
* Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer.
* Refactor rtsock_routemsg(): avoid accessing rtentry fields directly.
* Simplify in_addprefix() by moving prefix search to a separate  function.

Reviewers: #network

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D28011
2021-01-07 19:38:19 +00:00
Alexander V. Chernikov
9c0ff6a8bb Remove now-unused RT_GATEWAY* definitions.
They were used to simplify nexthop transition, hence not needed
 anymore.
2021-01-04 21:45:46 +00:00
Ryan Libby
833dbf1e22 route: quiet -Wredundant-decls
Remove declaration duplicated in
f5baf8bb12

Reviewed by:	melifaro
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27790
2020-12-27 16:32:27 -08:00
Alexander V. Chernikov
f733d9701b Fix default route handling in radix4_lockless algo.
Improve nexthop debugging.

Reported by:	Florian Smeets <flo at smeets.xyz>
2020-12-26 22:51:02 +00:00
Alexander V. Chernikov
4e19e0d92a Use light-weight versions of routing lookup functions in ng_netflow.
Use recently-added combination of `fib[46]_lookup_rt()` which
 returns rtentry & raw nexthop with `rt_get_inet[6]_plen()` which
 returns address/prefix length of prefix inside `rt`.

Add `nhop_select_func()` wrapper around inlined `nhop_select()` to
 allow callers external to the routing subsystem select the proper
 nexthop from the multipath group without including internal headers.

New calls does not require reference counting objects and reduce
 the amount of copied/processed rtentry data.

Differential Revision: https://reviews.freebsd.org/D27675
2020-12-26 11:27:38 +00:00
Alexander V. Chernikov
f5baf8bb12 Add modular fib lookup framework.
This change introduces framework that allows to dynamically
 attach or detach longest prefix match (lpm) lookup algorithms
 to speed up datapath route tables lookups.

Framework takes care of handling initial synchronisation,
 route subscription, nhop/nhop groups reference and indexing,
 dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
 picking the best matching algorithm on-the-fly based on the
 amount of routes in the routing table.

Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.

The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
  small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
  tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
  tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)

Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
  radix4: ~20mpps
  radix4_lockless: ~24.8mpps
  bsearch4: ~69mpps
  dpdk_lpm4: ~67 mpps
700k routes:
  radix4_lockless: 3.3mpps
  dpdk_lpm4: 46mpps

IPv6:
8 routes:
  radix6_lockless: ~20mpps
  dpdk_lpm6: ~70mpps
100k routes:
  radix6_lockless: 13.9mpps
  dpdk_lpm6: 57mpps

Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)

Control:
Framwork adds the following runtime sysctls:

List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless

Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.

Differential Revision: https://reviews.freebsd.org/D27401
2020-12-25 11:33:17 +00:00
Alexander V. Chernikov
df9053920f Add IPv4/IPv6 rtentry prefix accessors.
Multiple consumers like ipfw, netflow or new route lookup algorithms
 need to get the prefix data out of struct rtentry.
Instead of providing direct access to the rtentry, create IPv4/IPv6
 accessors to abstract struct rtentry internals and avoid including
 internal routing headers for external consumers.

While here, move struct route_nhop_data to the public header, so external
 customers can actually use lookup functions returning rt&nhop data.

Differential Revision:	https://reviews.freebsd.org/D27416
2020-12-03 22:23:57 +00:00
Alexander V. Chernikov
d1d941c5b9 Remove RADIX_MPATH config option.
ROUTE_MPATH is the new config option controlling new multipath routing
 implementation. Remove the last pieces of RADIX_MPATH-related code and
 the config option.

Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D27244
2020-11-29 19:43:33 +00:00
Alexander V. Chernikov
3b1654cb14 Introduce rib_walk_ext_internal() to allow iteration with rnh pointer.
This solves the case when rib is not yet attached/detached to/from the
 system rib array.

Differential Revision:	https://reviews.freebsd.org/D27406
2020-11-29 13:54:49 +00:00
Alexander V. Chernikov
f47fa26065 Add nhop_ref_any() to unify referencing nhop or nexthop group.
It allows code within routing subsystem to transparently reference nexthops
 and nexthop groups, similar to nhop_free_any(), abstracting ROUTE_MPATH
 details.

Differential Revision:	https://reviews.freebsd.org/D27410
2020-11-29 13:52:06 +00:00