This change introduces framework that allows to dynamically
attach or detach longest prefix match (lpm) lookup algorithms
to speed up datapath route tables lookups.
Framework takes care of handling initial synchronisation,
route subscription, nhop/nhop groups reference and indexing,
dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
picking the best matching algorithm on-the-fly based on the
amount of routes in the routing table.
Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.
The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
for large-fib (D27412)
Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
radix4: ~20mpps
radix4_lockless: ~24.8mpps
bsearch4: ~69mpps
dpdk_lpm4: ~67 mpps
700k routes:
radix4_lockless: 3.3mpps
dpdk_lpm4: 46mpps
IPv6:
8 routes:
radix6_lockless: ~20mpps
dpdk_lpm6: ~70mpps
100k routes:
radix6_lockless: 13.9mpps
dpdk_lpm6: 57mpps
Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)
Control:
Framwork adds the following runtime sysctls:
List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless
Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.
Differential Revision: https://reviews.freebsd.org/D27401
When copying sockaddrs out to userspace, we pad them to a multiple of
the platform alignment (sizeof(long)). However, some sockaddr sizes,
such as struct sockaddr_dl, are not an integer multiple of the
alignment, so we may end up copying out uninitialized bytes.
Fix this by always bouncing through a pre-zeroed sockaddr_storage.
Reported by: KASAN
Reviewed by: melifaro
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27729
This improves cache behaviour by not writing to the same variable from
multiple cores simultaneously.
pf_state is only used in the kernel, so can be safely modified.
Reviewed by: Lutz Donnerhacke, philip
MFC after: 1 week
Sponsed by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27661
The algorithm we use to update checksums only works correctly if the
updated data is aligned on 16-bit boundaries (relative to the start of
the packet).
Import the OpenBSD fix for this issue.
PR: 240416
Obtained from: OpenBSD
MFC after: 1 week
Reviewed by: tuexen (previous version)
Differential Revision: https://reviews.freebsd.org/D27696
And switch from int to bool while at it.
Reviewed by: melifaro@
Differential Revision: https://reviews.freebsd.org/D27725
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
The packet, if processed at this point, was already parsed to be UDP
directed to a vxlan port.
Connect-X 4+ does not provide easy method to infer which parser
processed the packet, so driver cannot set the flag without a lot of
efforts which are only to satisfy the formal requirements.
Reviewed by: bryanv, np
Sponsored by: Mellanox Technologies/NVidia Networking
Differential revision: https://reviews.freebsd.org/D27449
MFC after: 1 week
rtsock code was build around the assumption that each rtentry record
in the system radix tree is a ready-to-use sockaddr. This assumptions
turned out to be not quite true:
* masks have their length tweaked, so we have rtsock_fix_netmask() hack
* IPv6 addresses have their scope embedded, so we have another explicit
deembedding hack.
Change the code to decouple rtentry internals from rtsock code using
newly-created rtentry accessors. This will allow to eventually eliminate
both of the hacks and change rtentry dst/mask format.
Differential Revision: https://reviews.freebsd.org/D27451
struct ifconf and struct ifreq use the odd style "struct<tab>foo".
struct ifdrv seems to have tried to follow this but was committed with
spaces in place of most tabs resulting in "struct<space><space>ifdrv".
MFC after: 3 days
In some error paths we would fail to detach from the iflib taskqueue
groups. Also move the detach code into its own subroutine instead of
duplicating it.
Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27342
Multiple consumers like ipfw, netflow or new route lookup algorithms
need to get the prefix data out of struct rtentry.
Instead of providing direct access to the rtentry, create IPv4/IPv6
accessors to abstract struct rtentry internals and avoid including
internal routing headers for external consumers.
While here, move struct route_nhop_data to the public header, so external
customers can actually use lookup functions returning rt&nhop data.
Differential Revision: https://reviews.freebsd.org/D27416
Revert the mitigation code for the vnet/epair cleanup race (done in r365457).
r368237 introduced a more reliable fix.
MFC after: 2 weeks
Sponsored by: Modirum MDPay
When destroying a vnet and an epair (with one end in the vnet) we often
panicked. This was the result of the destruction of the epair, which destroys
both ends simultaneously, happening while vnet_if_return() was moving the
struct ifnet to its home vnet. This can result in a freed ifnet being re-added
to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
used.
Prevent this race by ensuring that vnet_if_return() cannot run at the same time
as if_detach() or epair_clone_destroy().
PR: 238870, 234985, 244703, 250870
MFC after: 2 weeks
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D27378
ROUTE_MPATH is the new config option controlling new multipath routing
implementation. Remove the last pieces of RADIX_MPATH-related code and
the config option.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D27244
It allows code within routing subsystem to transparently reference nexthops
and nexthop groups, similar to nhop_free_any(), abstracting ROUTE_MPATH
details.
Differential Revision: https://reviews.freebsd.org/D27410
No functional changes.
* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
(fib<46>_lookup_rt()). These will be used in the control plane code
requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
This change simplifies the switch to alternative lookup implementations,
which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys
Differential Revision: https://reviews.freebsd.org/D27405
The resulting KPI can be used by routing table consumers to estimate the required
scale for route table export.
* Add tracking for rib routes
* Add accessors for number of nexthops/nexthop objects
* Simplify rib_unsubscribe: store rnh we're attached to instead of requiring it up
again on destruction. This helps in the cases when rnh is not linked yet/already unlinked.
Differential Revision: https://reviews.freebsd.org/D27404
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
When we terminate a vnet (i.e. jail) we move interfaces back to their home
vnet. We need to protect our access to the V_ifnet CK_LIST.
We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove())
waits for net epoch callback completion. That's not possible from NET_EPOCH.
Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to
move and, once we've released the lock, move them back to their home vnet.
We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a
LOR between ifnet_sx, in_multi_sx and iflib ctx lock.
Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as
we do the list manipulation, but do not hold it as we if_vmove().
Reviewed by: melifaro
MFC after: 2 weeks
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D27279
It no longer serves any purpose, as evidenced by the fact that we never take it
without ifnet_sxlock.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D27278
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
consistent and prepare for upcoming iterator optimizations.
Differential Revision: https://reviews.freebsd.org/D27219
Don't use "new" as an identifier, and add explicit casts from void *.
As a general policy, FreeBSD doesn't make any C++ compatibility
guarantees for kernel headers like it does for userland, but it is a
small effort to do so in this case, to the benefit of a downstream
consumer (NetApp).
Reviewed by: rscheff
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27286
When links come and go, lacp goes into a "suppress distributing" mode
where it drops traffic for 3 seconds. When in this mode, lagg/lacp
historiclally drops traffic with ENETDOWN. That return value causes TCP
to close any connection where it gets that value back from the lower
parts of the stack. This means that any TCP connection with active
traffic during a 3-second windown when an LACP link comes or goes
would get closed.
TCP treats return values of ENOBUFS as transient errors, and re-schedules
transmission later. So rather than returning ENETDOWN, lets
return ENOBUFS instead. This allows TCP connections to be preserved.
I've tested this by repeatedly bouncing links on a Netlfix CDN server
under a moderate (20Gb/s) load and overved ENOBUFS reported back to
the TCP stack (as reported by a RACK TCP sysctl).
Reviewed by: jhb, jtl, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27188
lagg(4) replaces if_output method of its child interfaces and expects
that this method can be called only by child interfaces. But it is
possible that lagg_port_output() could be called by children of child
interfaces. In this case ifnet's if_lagg field is NULL. Add check that
lp is not NULL.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
Add the missing static keyword present in the declaration.
Reviewed by: melifaro
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27024
The goal of the fib support is to provide multiple independent
routing tables, isolated from each other.
net.add_addr_allfibs default tries to shift gears in the opposite
direction, unconditionally inserting all addresses to all of the fibs.
There are use cases when this is necessary, however this is not a
default expected behaviour, especially compared to other implementations.
Provide WARNING message for the setups with multiple fibs to notify
potential users of the feature.
Differential Revision: https://reviews.freebsd.org/D26076
Stop advancing counter past the current iteration number at the start
of iteration. This removes the need of subtracting one when
calculating index for copyout, and arguably fixes off-by-one reporting
of copied out elements when copyout failed.
Reviewed by: hselasky
Sponsored by: Mellanox Technologies / NVidia Networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27073
Use consistent output format for hex.
Print both media and mask where relevant.
Reviewed by: hselasky
Sponsored by: Mellanox Technologies/NVidia Networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27034
This can be used to detect if an ethernet address is specifically an
IPv6 multicast address, defined in accordance to RFC 2464.
ETHER_IS_MULTICAST is still preferred in the general case.
Reviewed by: ae
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26611
This gives a more uniform API for send tag life cycle management.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000
- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).
- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.
- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.
- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.
- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.
- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].
Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691
Code was supposed to call callout_reset_sbt_on() rather than
callout_reset_sbt(). This resulted into passing a "cpu" value
to a "flag" argument. A recipe for subtle errors.
PR: 248652
Reported by: sg@efficientip.com
MFC with: r367093
The way netmap TX is handled in iflib when TX interrupts are not
used (IFC_NETMAP_TX_IRQ not set) has some issues:
- The netmap_tx_irq() function gets called by iflib_timer(), which
gets scheduled with tick granularity (hz). This is not frequent
enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
result is that the transmitting netmap application is not woken
up fast enough to saturate the link with small packets.
- The iflib_timer() functions also calls isc_txd_credits_update()
to ask for more TX completion updates. However, this violates
the netmap requirement that only txsync can access the TX queue
for datapath operations. Only netmap_tx_irq() may be called out
of the txsync context.
This change introduces per-tx-queue netmap timers, using microsecond
granularity to ensure that netmap_tx_irq() can be called often enough
to allow for maximum packet rate. The timer routine simply calls
netmap_tx_irq() to wake up the netmap application. The latter will
wake up and call txsync to collect TX completion updates.
This change brings back line rate speed with small packets for ixgbe.
For the time being, timer expiration is hardcoded to 90 microseconds,
in order to avoid introducing a new sysctl.
We may eventually implement an adaptive expiration period or use another
deferred work mechanism in place of timers.
Also, fix the timers usage to make sure that each queue is serviced
by a different CPU.
PR: 248652
Reported by: sg@efficientip.com
MFC after: 2 weeks