Commit Graph

351 Commits

Author SHA1 Message Date
Alexander V. Chernikov
4b631fc832 routing: fix source address selection rules for IPv4 over IPv6.
Current logic always selects an IFA of the same family from the
 outgoing interfaces. In IPv4 over IPv6 setup there can be just
 single non-127.0.0.1 ifa, attached to the loopback interface.

Create a separate rt_getifa_family() to handle entire ifa selection
 for the IPv4 over IPv6.

Differential Revision: https://reviews.freebsd.org/D31868
MFC after:	1 week
2021-09-07 21:41:05 +00:00
Alexander V. Chernikov
d98954e229 routing: Bring back the ability to specify transmit interface via its name.
Some software references outgoing interfaces by specifying name instead of
 index.

Use rti_ifp from rt_addrinfo if provided instead of always using
 address interface when constructing nexthop.

PR: 		255678
Reported by:	martin.larsson2 at gmail.com
MFC after:	1 week
2021-08-29 20:05:14 +00:00
Rozhuk Ivan
a75819461e devctl: add ADDR_ADD and ADDR_DEL devctl event for IFNET
Add devd event on network iface address add/remove.  Can be used to
automate actions on any address change.

Reviewed by:		imp@ (and minor style tweaks)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D30840
2021-06-23 10:26:56 -06:00
Mark Johnston
8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Mark Johnston
274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Alexander V. Chernikov
b1d63265ac Flush remaining routes from the routing table during VNET shutdown.
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
 VNET.

PR:	253998
Reported by:	rashey at superbox.pl
MFC after:	3 days

Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.

Test Plan:
```
set_skip:set_skip_group_lo  ->  passed  [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```

Reviewers: #network, kp, bz

Reviewed By: kp

Subscribers: imp, ae

Differential Revision: https://reviews.freebsd.org/D29116
2021-03-10 21:10:14 +00:00
Alexander V. Chernikov
5964172837 Simplify ifa/ifp refcounting in the routing stack.
The routing stack control depends on quite a tree of functions to
 determine the proper attributes of a route such as a source address (ifa)
 or transmit ifp of a route.

When actually inserting a route, the stack needs to ensure that ifa and ifp
 points to the entities that are still valid.
Validity means slightly more than just pointer validity - stack need guarantee
 that the provided objects are not scheduled for deletion.

Currently, callers either ignore it (most ifp parts, historically) or try to
 use refcounting (ifa parts). Even in case of ifa refcounting it's not always
 implemented in fully-safe manner. For example, some codepaths inside
 rt_getifa_fib() are referencing ifa while not holding any locks, resulting in
 possibility of referencing scheduled-for-deletion ifa.

Instead of trying to fix all of the callers by enforcing proper refcounting,
 switch to a different model.
As the rib_action() already requires epoch, do not require any stability guarantees
 other than the epoch-provided one.
Use newly-added conditional versions of the refcounting functions
 (ifa_try_ref(), if_try_ref()) and fail if any of these fails.

Reviewed by:	donner
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28837
2021-02-22 23:37:59 +00:00
Alexander V. Chernikov
cb984c62d7 Fix multipath support for rib_lookup_info().
The initial plan was to remove rib_lookup_info() before
 FreeBSD 13. As several customers are still remaining,
 fix rib_lookup_info() for the multipath use case.
2021-01-29 23:14:24 +00:00
Alexander V. Chernikov
81728a538d Split rtinit() into multiple functions.
rtinit[1]() is a function used to add or remove interface address prefix routes,
  similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
 in reality.

1) IPv6 code does not use it for the ifa routes. There is a separate layer,
  nd6_prelist_(), providing interface for maintaining interface routes. Its part,
  responsible for the actual route table interaction, mimics rtenty() code.

2) rtinit tries to combine multiple actions in the same function: constructing
  proper route attributes and handling iterations over multiple fibs, for the
  non-zero net.add_addr_allfibs use case. It notably increases the code complexity.

3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
 for p2p connections, host routes and p2p routes are handled in the same way.
 Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
 It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
 aliases.

4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
 complicating rib_action() implementation.

5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
 ifa messages in certain corner cases.

To address all these points, the following has been done:

* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
 dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
 is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
 responsible for the actual route notifications.

Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses

Differential revision:	https://reviews.freebsd.org/D28186
2021-01-16 22:42:41 +00:00
Alexander V. Chernikov
d68cf57b7f Refactor rt_addrmsg() and rt_routemsg().
Summary:
* Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally.
* Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer.
* Refactor rtsock_routemsg(): avoid accessing rtentry fields directly.
* Simplify in_addprefix() by moving prefix search to a separate  function.

Reviewers: #network

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D28011
2021-01-07 19:38:19 +00:00
Alexander V. Chernikov
f5baf8bb12 Add modular fib lookup framework.
This change introduces framework that allows to dynamically
 attach or detach longest prefix match (lpm) lookup algorithms
 to speed up datapath route tables lookups.

Framework takes care of handling initial synchronisation,
 route subscription, nhop/nhop groups reference and indexing,
 dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
 picking the best matching algorithm on-the-fly based on the
 amount of routes in the routing table.

Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.

The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
  small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
  tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
  tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)

Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
  radix4: ~20mpps
  radix4_lockless: ~24.8mpps
  bsearch4: ~69mpps
  dpdk_lpm4: ~67 mpps
700k routes:
  radix4_lockless: 3.3mpps
  dpdk_lpm4: 46mpps

IPv6:
8 routes:
  radix6_lockless: ~20mpps
  dpdk_lpm6: ~70mpps
100k routes:
  radix6_lockless: 13.9mpps
  dpdk_lpm6: 57mpps

Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)

Control:
Framwork adds the following runtime sysctls:

List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless

Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.

Differential Revision: https://reviews.freebsd.org/D27401
2020-12-25 11:33:17 +00:00
Alexander V. Chernikov
d1d941c5b9 Remove RADIX_MPATH config option.
ROUTE_MPATH is the new config option controlling new multipath routing
 implementation. Remove the last pieces of RADIX_MPATH-related code and
 the config option.

Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D27244
2020-11-29 19:43:33 +00:00
Alexander V. Chernikov
7511a63825 Refactor rib iterator functions.
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
 consistent and prepare for upcoming iterator optimizations.

Differential Revision:	https://reviews.freebsd.org/D27219
2020-11-22 20:21:10 +00:00
Alexander V. Chernikov
bad6b23606 Move all ifaddr route creation business logic to net/route/route_ifaddr.c
Differential Revision:	https://reviews.freebsd.org/D26318
2020-11-08 11:12:00 +00:00
Alexander V. Chernikov
fedeb08b6a Introduce scalable route multipath.
This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
 relative weights and a dataplane-optimized structure to enable
 efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
 gets compiled during group creation and is basically an array of
 nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
 nexthop or nexthop group. They are distinguished by the presense of
 NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
 should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
 This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx  NhIdx     Weight   Slots                                 Gateway     Netif  Refcnt
1         ------- ------- ------- --------------------------------------- ---------       1
              13      10       1                             2001:db8::2     vlan2
              14      20       2                             2001:db8::3     vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by:	olivier
Reviewed by:	glebius
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D26449
2020-10-03 10:47:17 +00:00
Alexander V. Chernikov
2259a03020 Rework part of routing code to reduce difference to D26449.
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
 as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
 that rtentry can contain multiple nexthops.

Differential Revision:	https://reviews.freebsd.org/D26497
2020-09-21 20:02:26 +00:00
Alexander V. Chernikov
05aca418f4 Consistently use the same gateway when adding/deleting interface routes.
Use the same link-level gateway when adding or deleting interface routes.
This helps nexthop checking in the upcoming multipath changes.

Differential Revision:	https://reviews.freebsd.org/D26317
2020-09-07 10:13:54 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Alexander V. Chernikov
a624ca3dff Move net/route/shared.h definitions to net/route/route_var.h.
No functional changes.

net/route/shared.h was created in the inital phases of nexthop conversion.
It was intended to serve the same purpose as route_var.h - share definitions
 of functions and structures between the routing subsystem components. At
 that time route_var.h was included by many files external to the routing
 subsystem, which largerly defeats its purpose.

As currently this is not the case anymore and amount of route_var.h includes
 is roughly the same as shared.h, retire the latter in favour of the former.
2020-08-28 22:50:20 +00:00
Alexander V. Chernikov
592d300e34 Remove RT_LOCK mutex from rte.
rtentry lock traditionally served 2 purposed: first was protecting refcounts,
 the second was assuring consistent field access/changes.
Since route nexthop introduction, the need for the former disappeared and
 the need for the latter reduced.
To be more precise, the following rte field are mutable:

rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info)
rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal)
rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info)
rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK)
rt_chain (deletion chain pointer, updated with RIB_WLOCK)
All of them are updated under RIB_WLOCK, so the only remaining concern is the reading.

rt_nhop and rt_weight (addressed in this review) are read under rib lock and
 stored in the rib_cmd_info, so the caller has no problem with consitency.
rte_flags is currently read unlocked in rtsock reporting (however the scope
 is only RTF_UP flag, which is pretty static).
rt_expire is currently read unlocked in rtsock reporting.
rt_chain accesses are safe, as this is only used at route deletion.

rt_expire and rte_flags reads will be dealt in a separate reviews soon.

Differential Revision:	https://reviews.freebsd.org/D26162
2020-08-24 20:23:34 +00:00
Alexander V. Chernikov
eb1c7adb70 Finish r364492 by renaming rt_flags to rte_flags for multipath code. 2020-08-22 20:02:40 +00:00
Alexander V. Chernikov
93bfd365d2 Rename rt_flags to rte_flags && reduce number of rt_nhop accesses.
No functional changes.

Most of the routing flags are stored in the netxtop instead of rtentry.
Rename rt->rt_flags to rt->rte_flags to simplify reading/modifying code
 checking routing flags.

In the new multipath code, rt->rt_nhop may actually point to nexthop group
 instead of nhop. To ease transition, reduce the amount of rt->rt_nhop->...
 accesses.

Differential Revision:	https://reviews.freebsd.org/D26156
2020-08-22 19:30:56 +00:00
Alexander V. Chernikov
f5247a232a Make net.fibs growable.
Allow to dynamically grow the amount of fibs in each vnet.

This change alters current behavior. Currently, if one defines
 ROUTETABLES > 1 in the kernel config, each vnet will be created
 with the number of fibs defined in the kernel config.
 After this commit vnets will be created with fibs=1.

Dynamic net.fibs is not compatible with net.add_addr_allfibs.
 The plan is to deprecate the latter and make
 net.add_addr_allfibs=0 default behaviour.

Reviewed by:	glebius
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D26062
2020-08-21 21:34:52 +00:00
Alexander V. Chernikov
2f23f45b20 Simplify dom_<rtattach|rtdetach>.
Remove unused arguments from dom_rtattach/dom_rtdetach functions and make
  them return/accept 'struct rib_head' instead of 'void **'.
Declare inet/inet6 implementations in the relevant _var.h headers similar
  to domifattach / domifdetach.
Add rib_subscribe_internal() function to accept subscriptions to the rnh
  directly.

Differential Revision:	https://reviews.freebsd.org/D26053
2020-08-14 21:29:56 +00:00
Alexander V. Chernikov
6cbadc4234 Move rtzone handling code to net/route_ctl.c
After moving the route control plane code from net/route.c,
 all rtzone users ended up being in net/route_ctl.c.
Move uma(9) rtzone setup/teardown code to net/route_ctl.c as well
 to have everything in a single place.

While here, remove custom initializers from the zone.
It was added originally to avoid setup/teardown of costy per-cpu couters.
With these counters removed, the only remaining job was avoiding rte mutex
 setup/teardown. Mutex setup is relatively cheap. Additionally, this mutex
 will soon be removed. With that in mind, there is no sense in keeping
 custom zone callbacks.

Differential Revision:	https://reviews.freebsd.org/D26051
2020-08-13 18:35:29 +00:00
Mitchell Horne
f7d79f6c6d Correctly set error in rt_mpath_unlink
It is possible for rn_delete() to return NULL. If this happens, then set
*perror to ESRCH, as is done in the rest of the function.

Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D25871
2020-08-12 16:43:20 +00:00
Alexander V. Chernikov
e1c05fd290 Transition from rtrequest1_fib() to rib_action().
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
 to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
2020-07-21 19:56:13 +00:00
Alexander V. Chernikov
725871230d Temporarly revert r363319 to unbreak the build.
Reported by:	CI
Pointy hat to: melifaro
2020-07-19 10:53:15 +00:00
Alexander V. Chernikov
8cee15d9e4 Transition from rtrequest1_fib() to rib_action().
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25546
2020-07-19 09:29:27 +00:00
Alexander V. Chernikov
edc37a66e3 Add destructor for the rib subscription system to simplify users code.
Subscriptions are planned to be used by modules such as route lookup engines.
In that case that's the module task to properly unsibscribe before detach.
However, the in-kernel customer - inet6 wants to track default route changes.
To avoid having inet6 store per-fib subscriptions, handle automatic
 destruction internally.

Differential Revision:	https://reviews.freebsd.org/D25614
2020-07-12 11:18:09 +00:00
Alexander V. Chernikov
da187ddb3d * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D25067
2020-06-01 20:49:42 +00:00
Alexander V. Chernikov
e7403d0230 Revert r361704, it accidentally committed merged D25067 and D25070. 2020-06-01 20:40:40 +00:00
Alexander V. Chernikov
79674562b8 * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:	ae
Differential Revision: https://reviews.freebsd.org/D25067
2020-06-01 20:32:02 +00:00
Alexander V. Chernikov
cb86ca48bf Unlock rtentry before calling for epoch(9) destruction as the destruction
may happen immediately, leading to panic.

Reported by:	bdragon
2020-05-28 07:23:27 +00:00
Alexander V. Chernikov
4d2c2509f2 Move <add|del|change>_route() functions to route_ctl.c in preparation of
multipath control plane changed described in D24141.

Currently route.c contains core routing init/teardown functions, route table
 manipulation functions and various helper functions, resulting in >2KLOC
 file in total. This change moves most of the route table manipulation parts
 to a dedicated file, simplifying planned multipath changes and making
 route.c more manageable.

Differential Revision:	https://reviews.freebsd.org/D24870
2020-05-23 19:06:57 +00:00
Alexander V. Chernikov
a82f62ec2d Remove refcounting from rtentry.
After making rtentry reclamation backed by epoch(9) in r361409, there is
 no reason in keeping reference counting code.

Differential Revision:	https://reviews.freebsd.org/D24867
2020-05-23 12:15:47 +00:00
Alexander V. Chernikov
2bbab0af6d Use epoch(9) for rtentries to simplify control plane operations.
Currently the only reason of refcounting rtentries is the need to report
 the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
 is used by some of the customers to get the matching prefix along
 with nexthops, in more efficient way.

The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
 nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .

Differential Revision:	https://reviews.freebsd.org/D24866
2020-05-23 10:21:02 +00:00
Alexander V. Chernikov
4a6ee281d9 Remove unused rnh_close callback from rtable & cleanup depends.
rnh_close callbackes was used by the in[6]_clsroute() handlers,
 doing cleanup in the route cloning code. Route cloning was eliminated
 somewhere around r186119. Last callback user was eliminated in r186215,
 11 years ago.

Differential Revision:	https://reviews.freebsd.org/D24793
2020-05-11 06:09:18 +00:00
Alexander V. Chernikov
d223372545 Remove rtalloc1(_fib) KPI.
Last user of rtalloc1() KPI has been eliminated in rS360631.
As kernel is now fully switched to use new routing KPI defined in
rS359823, remove old lookup functions.

Differential Revision:	https://reviews.freebsd.org/D24776
2020-05-10 09:34:48 +00:00
Alexander V. Chernikov
656442a718 Embed dst sockaddr into rtentry and remove rte packet counter
Currently each rtentry has dst&gateway allocated separately from another zone,
 bloating cache accesses.

Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning,
 leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64).
Fields needed for the datapath are destination sockaddr and rt_nhop.

So far it doesn't look like there is other routable addressing protocol other
 than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes.
With that in mind, embed dst into struct rtentry, making the first 24 bytes
 of rtentry within 128 bytes. That is enough to make IPv6 address within first
 128 bytes.

It is still pretty easy to add code for supporting separately-allocated dst,
 however it doesn't make a lot of sense in having such code without a use case.

As rS359823 moved the gateway to the nexthop structure, the dst embedding change
 removes the need for any additional allocations done by rt_setgate().

Lastly, as a part of cleanup, remove counter(9) allocation code, as this field
 is not used in packet processing anymore.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24669
2020-05-08 21:06:10 +00:00
Alexander V. Chernikov
682b902daf Add rib_lookup() sockaddr lookup wrapper and make ifa_ifwithroute use it.
Create rib_lookup() wrapper around per-af dataplane lookup functions.
This will help in the cases of having control plane af-agnostic code.

Switch ifa_ifwithroute() to use this function instead of rtalloc1().

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24731
2020-05-07 08:11:36 +00:00
Alexander V. Chernikov
9e02229580 Remove now-unused rt_ifp,rt_ifa,rt_gateway,rt_mtu rte fields.
After converting routing subsystem customers to use nexthop objects
 defined in r359823, some fields in struct rtentry became unused.

This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
 along with the code initializing and updating these fields.

Cleanup of the remaining fields will be addressed by D24669.

This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
 resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
 nexthop and tries to update rte nhop pointer. Only last part is done under
 WLOCK.
In the hypothetical scenarious where multiple rtsock clients
 repeatedly issue RTM_CHANGE requests for the same route, route may get
 updated between read and update operation. This is addressed by retrying
 the operation multiple (3) times before returning failure back to the
 caller.

Differential Revision:	https://reviews.freebsd.org/D24666
2020-05-04 14:31:45 +00:00
Alexander V. Chernikov
8c61eb2107 Convert more rtentry field accesses into nhop fields accesses.
Continue routing subsystem conversion to nhop objects defined in r359823.
Use fields from nhop structure instead of "struct rtentry" fields.
This is one of the last changes prior to removing rt_ifp, rt_ifa,
 rt_gateway and rt_mtu from struct rtentry.

Differential Revision:	https://reviews.freebsd.org/D24609
2020-04-29 21:54:09 +00:00
Alexander V. Chernikov
74787ef47b Add nhop to the ifa_rtrequest() callback.
With the upcoming multipath changes described in D24141,
 rt->rt_nhop can potentially point to a nexthop group instead of
 an individual nhop.
To simplify caller handling of such cases, change ifa_rtrequest() callback
 to pass changed nhop directly.

Differential Revision:	https://reviews.freebsd.org/D24604
2020-04-29 19:28:56 +00:00
Alexander V. Chernikov
e7d8af4f65 Move route_temporal.c and route_var.h to net/route.
Nexthop objects implementation, defined in r359823,
 introduced sys/net/route directory intended to hold all
 routing-related code. Move recently-introduced route_temporal.c and
 private route_var.h header there.

Differential Revision:	https://reviews.freebsd.org/D24597
2020-04-28 19:14:09 +00:00
Alexander V. Chernikov
1b0051bada Eliminate now-unused parts of old routing KPI.
r360292 switched most of the remaining routing customers to a new KPI,
 leaving a bunch of wrappers for old routing lookup functions unused.

Remove them from the tree as a part of routing cleanup.

Differential Revision:	https://reviews.freebsd.org/D24569
2020-04-28 07:25:34 +00:00
Alexander V. Chernikov
983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Alexander V. Chernikov
aaad3c4fca Convert rtentry field accesses into nhop field accesses.
One of the goals of the new routing KPI defined in r359823 is to entirely
 hide`struct rtentry` from the consumers. It will allow to improve routing
 subsystem internals and deliver more features much faster.

This commit is mostly mechanical change to eliminate direct struct rtentry
 field accesses.

The only notable difference is AF_LINK gateway encoding.

AF_LINK gw is used in routing stack for operations with interface routes
 and host loopback routes.
In the former case it indicates _some_ non-NULL gateway, as the interface
 is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting.
In the latter case the interface index inside gateway was used by the IPv6
 datapath to verify address scope for link-local interfaces.

Kernel uses struct sockaddr_dl for this type of gateway. This structure
 allows for specifying rich interface data, such as mac address and interface
 name. However, this results in relatively large structure size - 52 bytes.
Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside
 in the first 8 bytes of the structure.

In the new KPI, struct nhop_object tries to be cache-efficient, hence
 embodies gateway address inside the structure. In the AF_LINK case it
 stores stortened version of the structure - struct sockaddr_dl_short,
 which occupies 16 bytes. After D24340 changes, the data inside AF_LINK
 gateway will not be used in the kernel at all, leaving rtsock as the only
 potential concern.

The difference in rtsock reporting:

(old)
got message of size 240 on Thu Apr 16 03:12:13 2020
RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks:  inits:
sockaddrs: <DST,GATEWAY,NETMASK>
 10.0.0.0 link#5 255.255.255.0

(new)
got message of size 200 on Sun Apr 19 09:46:32 2020
RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks:  inits:
sockaddrs: <DST,GATEWAY,NETMASK>
 10.0.0.0 link#5 255.255.255.0

Note 40 bytes different (52-16 + alignment).
However, gateway is still a valid AF_LINK gateway with proper data filled in.

It is worth noting that these particular messages (interface routes) are mostly
 ignored by routing daemons:
* bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages.
* quagga/frr ignores routes without gateway

More detailed overview on how rtsock messages are used by the
 routing daemons to reconstruct the kernel view, can be found in D22974.

Differential Revision:	https://reviews.freebsd.org/D24519
2020-04-23 08:04:20 +00:00
Alexander V. Chernikov
539642a29d Add nhop parameter to rti_filter callback.
One of the goals of the new routing KPI defined in r359823 is to
 entirely hide`struct rtentry` from the consumers. It will allow to
 improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
 struct rtentry field accesses.

Additionally, with the followup multipath changes, single rtentry can point
 to multiple nexthops.

With that in mind, convert rti_filter callback used when traversing the
 routing table to accept pair (rt, nhop) instead of nexthop.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24440
2020-04-16 17:20:18 +00:00
Alexander V. Chernikov
dd4776f0cc Reorganise nd6 notification code to avoid direct rtentry field access.
One of the goals of the new routing KPI defined in r359823 is to entirely hide
 `struct rtentry` from the consumers. Doing so will allow to improve routing
 subsystem internals and deliver features more easily. This change is one of
  the ongoing changes to eliminate direct struct rtentry field accesses.

It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification
 code to avoid accessing most of the rtentry fields.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24404
2020-04-14 22:48:33 +00:00