It allows code within routing subsystem to transparently reference nexthops
and nexthop groups, similar to nhop_free_any(), abstracting ROUTE_MPATH
details.
Differential Revision: https://reviews.freebsd.org/D27410
No functional changes.
* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
(fib<46>_lookup_rt()). These will be used in the control plane code
requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
This change simplifies the switch to alternative lookup implementations,
which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys
Differential Revision: https://reviews.freebsd.org/D27405
The resulting KPI can be used by routing table consumers to estimate the required
scale for route table export.
* Add tracking for rib routes
* Add accessors for number of nexthops/nexthop objects
* Simplify rib_unsubscribe: store rnh we're attached to instead of requiring it up
again on destruction. This helps in the cases when rnh is not linked yet/already unlinked.
Differential Revision: https://reviews.freebsd.org/D27404
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
When we terminate a vnet (i.e. jail) we move interfaces back to their home
vnet. We need to protect our access to the V_ifnet CK_LIST.
We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove())
waits for net epoch callback completion. That's not possible from NET_EPOCH.
Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to
move and, once we've released the lock, move them back to their home vnet.
We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a
LOR between ifnet_sx, in_multi_sx and iflib ctx lock.
Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as
we do the list manipulation, but do not hold it as we if_vmove().
Reviewed by: melifaro
MFC after: 2 weeks
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D27279
It no longer serves any purpose, as evidenced by the fact that we never take it
without ifnet_sxlock.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D27278
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
consistent and prepare for upcoming iterator optimizations.
Differential Revision: https://reviews.freebsd.org/D27219
Don't use "new" as an identifier, and add explicit casts from void *.
As a general policy, FreeBSD doesn't make any C++ compatibility
guarantees for kernel headers like it does for userland, but it is a
small effort to do so in this case, to the benefit of a downstream
consumer (NetApp).
Reviewed by: rscheff
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27286
When links come and go, lacp goes into a "suppress distributing" mode
where it drops traffic for 3 seconds. When in this mode, lagg/lacp
historiclally drops traffic with ENETDOWN. That return value causes TCP
to close any connection where it gets that value back from the lower
parts of the stack. This means that any TCP connection with active
traffic during a 3-second windown when an LACP link comes or goes
would get closed.
TCP treats return values of ENOBUFS as transient errors, and re-schedules
transmission later. So rather than returning ENETDOWN, lets
return ENOBUFS instead. This allows TCP connections to be preserved.
I've tested this by repeatedly bouncing links on a Netlfix CDN server
under a moderate (20Gb/s) load and overved ENOBUFS reported back to
the TCP stack (as reported by a RACK TCP sysctl).
Reviewed by: jhb, jtl, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27188
lagg(4) replaces if_output method of its child interfaces and expects
that this method can be called only by child interfaces. But it is
possible that lagg_port_output() could be called by children of child
interfaces. In this case ifnet's if_lagg field is NULL. Add check that
lp is not NULL.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
Add the missing static keyword present in the declaration.
Reviewed by: melifaro
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27024
The goal of the fib support is to provide multiple independent
routing tables, isolated from each other.
net.add_addr_allfibs default tries to shift gears in the opposite
direction, unconditionally inserting all addresses to all of the fibs.
There are use cases when this is necessary, however this is not a
default expected behaviour, especially compared to other implementations.
Provide WARNING message for the setups with multiple fibs to notify
potential users of the feature.
Differential Revision: https://reviews.freebsd.org/D26076
Stop advancing counter past the current iteration number at the start
of iteration. This removes the need of subtracting one when
calculating index for copyout, and arguably fixes off-by-one reporting
of copied out elements when copyout failed.
Reviewed by: hselasky
Sponsored by: Mellanox Technologies / NVidia Networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27073
Use consistent output format for hex.
Print both media and mask where relevant.
Reviewed by: hselasky
Sponsored by: Mellanox Technologies/NVidia Networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27034
This can be used to detect if an ethernet address is specifically an
IPv6 multicast address, defined in accordance to RFC 2464.
ETHER_IS_MULTICAST is still preferred in the general case.
Reviewed by: ae
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26611
This gives a more uniform API for send tag life cycle management.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000
- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).
- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.
- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.
- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.
- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.
- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].
Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691
Code was supposed to call callout_reset_sbt_on() rather than
callout_reset_sbt(). This resulted into passing a "cpu" value
to a "flag" argument. A recipe for subtle errors.
PR: 248652
Reported by: sg@efficientip.com
MFC with: r367093
The way netmap TX is handled in iflib when TX interrupts are not
used (IFC_NETMAP_TX_IRQ not set) has some issues:
- The netmap_tx_irq() function gets called by iflib_timer(), which
gets scheduled with tick granularity (hz). This is not frequent
enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
result is that the transmitting netmap application is not woken
up fast enough to saturate the link with small packets.
- The iflib_timer() functions also calls isc_txd_credits_update()
to ask for more TX completion updates. However, this violates
the netmap requirement that only txsync can access the TX queue
for datapath operations. Only netmap_tx_irq() may be called out
of the txsync context.
This change introduces per-tx-queue netmap timers, using microsecond
granularity to ensure that netmap_tx_irq() can be called often enough
to allow for maximum packet rate. The timer routine simply calls
netmap_tx_irq() to wake up the netmap application. The latter will
wake up and call txsync to collect TX completion updates.
This change brings back line rate speed with small packets for ixgbe.
For the time being, timer expiration is hardcoded to 90 microseconds,
in order to avoid introducing a new sysctl.
We may eventually implement an adaptive expiration period or use another
deferred work mechanism in place of timers.
Also, fix the timers usage to make sure that each queue is serviced
by a different CPU.
PR: 248652
Reported by: sg@efficientip.com
MFC after: 2 weeks
the failover protocol is supported due to limitations in the IPoIB
architecture. Refer to the lagg(4) manual page for how to configure
and use this new feature. A new network interface type,
IFT_INFINIBANDLAG, has been added, similar to the existing
IFT_IEEE8023ADLAG .
ifconfig(8) has been updated to accept a new laggtype argument when
creating lagg(4) network interfaces. This new argument is used to
distinguish between ethernet and infiniband type of lagg(4) network
interface. The laggtype argument is optional and defaults to
ethernet. The lagg(4) command line syntax is backwards compatible.
Differential Revision: https://reviews.freebsd.org/D26254
Reviewed by: melifaro@
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
802.1ad interfaces are created with ifconfig using the "vlanproto" parameter.
Eg., the following creates a 802.1Q VLAN (id #42) over a 802.1ad S-VLAN
(id #5) over a physical Ethernet interface (em0).
ifconfig vlan5 create vlandev em0 vlan 5 vlanproto 802.1ad up
ifconfig vlan42 create vlandev vlan5 vlan 42 inet 10.5.42.1/24
VLAN_MTU, VLAN_HWCSUM and VLAN_TSO capabilities should be properly
supported. VLAN_HWTAGGING is only partially supported, as there is
currently no IFCAP_VLAN_* denoting the possibility to set the VLAN
EtherType to anything else than 0x8100 (802.1ad uses 0x88A8).
Submitted by: Olivier Piras
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D26436
connections over multiple paths.
Multipath routing relies on mbuf flowid data for both transit
and outbound traffic. Current code fills mbuf flowid from inp_flowid
for connection-oriented sockets. However, inp_flowid is currently
not calculated for outbound connections.
This change creates simple hashing functions and starts calculating hashes
for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
system.
Reviewed by: glebius (previous version),ae
Differential Revision: https://reviews.freebsd.org/D26523
This flag is going to be used by IKE daemon to signal if
Extended Sequence Number feature is going to be used.
Value for this flag was taken from OpenBSD source code
6b4cbaf181
Submitted by: Patryk Duda <pdk@semihalf.com>
Reviewed by: ae
Differential revision: https://reviews.freebsd.org/D22366
Obtained from: Semihalf
Sponsored by: Stormshield
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.
Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting
Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26409
We're not allowed to hold NET_EPOCH while sleeping, so when we call ioctl()
handlers for member interfaces we cannot be in NET_EPOCH. We still need some
protection of our CK_LISTs, so hold BRIDGE_LOCK instead.
That requires changing BRIDGE_LOCK into a sleepable lock, and separating the
BRIDGE_RT_LOCK, to protect bridge_rtnode lists. That lock is taken in the data
path (while in NET_EPOCH), so it cannot be a sleepable lock.
While here document the locking strategy.
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D26418
Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway. This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).
In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.
Reviewed by: gallatin, hselasky, np, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26689
Nexthop lookup was not consireding rt_flags when doing
structure comparison, which lead to an original nexthop
selection when changing flags. Fix the case by adding
rt_flags field into comparison and rearranging nhop_priv
fields to allow for efficient matching.
Fix `route change X/Y flags` case - recent changes
disallowed specifying RTF_GATEWAY flag without actual gateway.
It turns out, route(8) fills in RTF_GATEWAY by default, unless
-interface flag is specified. Fix regression by clearing
RTF_GATEWAY flag instead of failing.
Fix route flag reporting in RTM_CHANGE messages by explicitly
updating rtm_flags after operation competion.
Add IPv4/IPv6 tests for flag-only route changes.
This change is based on the nexthop objects landed in D24232.
The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.
Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.
With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.
User-visible changes:
The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.
All routes now comes with weight, default weight is 1, maximum is 2^24-1.
Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.
Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1
route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20
netstat -6On
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2
Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default
Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449
For interfaces that do not support SIOCGIFMEDIA (for which there are
quite a few) the only fallback is to query the interface for
if_data->ifi_link_state. While it's possible to get at if_data for an
interface via getifaddrs(3) or sysctl, both are heavy weight mechanisms.
SIOCGIFDATA is a simple ioctl to retrieve this fast with very little
resource use in comparison. This implementation mirrors that of other
similar ioctls in FreeBSD.
Submitted by: Roy Marples <roy@marples.name>
Reviewed by: markj
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D26538
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
that rtentry can contain multiple nexthops.
Differential Revision: https://reviews.freebsd.org/D26497
* Zero gw_sdl if switching to interface route - the assumption
that underlying storage is zeroed is incorrect with route changes.
* Apply proper flag mask to rte.
Reported by: vangyzen
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.
A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.
On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.
An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.
On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.
Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.
Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.
Reviewed by: kib@
Relnotes: Yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25873
These are similar to the existing VLAN capabilities.
Reviewed by: kib@
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25873
Currently kernel requests deletion for the certain routes with specified gateway,
but this gateway is not actually checked. With multipath routes, internal gateway
checking becomes mandatory. Add the logic performing this check.
Generalise RTF_PINNED routes to the generic route priorities, simplifying the logic.
Add lookup_prefix() function to perform exact match search based on data in @info.
Differential Revision: https://reviews.freebsd.org/D26356
There's a race where dying vnets move their interfaces back to their original
vnet, and if_epair cleanup (where deleting one interface also deletes the other
end of the epair). This is commonly triggered by the pf tests, but also by
cleanup of vnet jails.
As we've not yet been able to fix the root cause of the issue work around the
panic by not dereferencing a NULL softc in epair_qflush() and by not
re-attaching DYING interfaces.
This isn't a full fix, but makes a very common panic far less likely.
PR: 244703, 238870
Reviewed by: lutz_donnerhacke.de
MFC after: 4 days
Differential Revision: https://reviews.freebsd.org/D26324
Use the same link-level gateway when adding or deleting interface routes.
This helps nexthop checking in the upcoming multipath changes.
Differential Revision: https://reviews.freebsd.org/D26317
After nexthop introduction, loopback routes for the interface addresses
were created without embedding actual interface index in the gateway.
The latter is needed to pass the IPv6 scope during transmission via loopback..
Fix the regression by actually using passed gateway data with interface index.
Differential Revision: https://reviews.freebsd.org/D26306
The pidx argument of isc_rxd_flush() indicates which is the last valid
receive descriptor to be used by the NIC. However, current code has
multiple issues:
- Intel drivers write pidx to their RDT register, which means that
NICs will only use the descriptors up to pidx-1 (modulo ring size N),
and won't actually use the one pointed by pidx. This does not break
reception, but it is anyway confusing and suboptimal (the NIC will
actually see only N-2 descriptors as available, rather than N-1).
Other drivers (if_vmx, if_bnxt, if_mgb) adhere to this semantic).
- The semantic used by Intel (RDT is one descriptor past the last
valid one) is used by most (if not all) NICs, and it is also used
on the TX side (also in iflib). Since iflib is not currently
using this semantic for RX, it must decrement fl->ifl_pidx
(modulo N) before calling isc_rxd_flush(), and then the
per-driver callback implementation must increment the index
again (to match the real semantic). This is confusing and suboptimal.
- The iflib refill function is also called at initialization.
However, in case the ring size is smaller than 128 (e.g. if_mgb),
the refill function will actually prepare all the receive
descriptors (N), without leaving one unused, as most of NICs assume
(e.g. to avoid RDT to overrun RDH). I can speculate that the code
looks like this right now because this issue showed up during
testing (e.g. with if_mgb), and it was easy to workaround by
decrementing pidx before isc_rxd_flush().
The goal of this change is to simplify the code (removing a bunch
of instructions from the RX fast path), and to make the semantic of
isc_rxd_flush() consistent across drivers. To achieve this, we:
- change the semantics of the pidx argument to the usual one (that
is the index one past the last valid one), so that both iflib and
drivers avoid the decrement/increment dance.
- fix the initialization code to prepare at most N-1 descriptors.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26191
No functional changes.
Initially this function was created to perform runtime flag conversions
for the previous incarnation of fib lookup functions. As these functions
got deprecated, move the function to the file with the only remaining
caller. Lastly, rename it to convert_rt_to_nh_flags() to follow the
naming notation.
No functional changes.
net/route/shared.h was created in the inital phases of nexthop conversion.
It was intended to serve the same purpose as route_var.h - share definitions
of functions and structures between the routing subsystem components. At
that time route_var.h was included by many files external to the routing
subsystem, which largerly defeats its purpose.
As currently this is not the case anymore and amount of route_var.h includes
is roughly the same as shared.h, retire the latter in favour of the former.
As nexthops are immutable, some operations such as route attribute changes
require nexthop fetching, forking, modification and route switching.
These operations are not atomic, so they may need to be retried multiple
times in presence of multiple speakers changing the same route.
This change introduces "synchronisation" primitive: route_update_conditional(),
simplifying logic for route changes and upcoming multipath operations.
Differential Revision: https://reviews.freebsd.org/D26216
At initialization time, the netmap RX refill function used to
prepare the NIC RX ring with N-1 buffers rather than N (with
N equal to the number of descriptors in the NIC RX ring).
This is not how netmap is supposed to work, as it would keep
kring->nr_hwcur not in sync with the NIC "next index to refill"
(i.e., fl->ifl_pidx). Instead we prepare N buffers, although we
still publish (with isc_rxd_flush()) only the first N-1 buffers,
to avoid the NIC producer pointer to overrun the NIC consumer
pointer (for NICs where this is a real issue, e.g. Intel ones).
MFC after: 2 weeks
rtentry lock traditionally served 2 purposed: first was protecting refcounts,
the second was assuring consistent field access/changes.
Since route nexthop introduction, the need for the former disappeared and
the need for the latter reduced.
To be more precise, the following rte field are mutable:
rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info)
rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal)
rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info)
rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK)
rt_chain (deletion chain pointer, updated with RIB_WLOCK)
All of them are updated under RIB_WLOCK, so the only remaining concern is the reading.
rt_nhop and rt_weight (addressed in this review) are read under rib lock and
stored in the rib_cmd_info, so the caller has no problem with consitency.
rte_flags is currently read unlocked in rtsock reporting (however the scope
is only RTF_UP flag, which is pretty static).
rt_expire is currently read unlocked in rtsock reporting.
rt_chain accesses are safe, as this is only used at route deletion.
rt_expire and rte_flags reads will be dealt in a separate reviews soon.
Differential Revision: https://reviews.freebsd.org/D26162
The semantic of the pidx argument of isc_rxd_flush() is the
last valid index of in the free list, rather than the next
index to be published. However, netmap was still using the
old convention. While there, also refactor the netmap_fl_refill()
to simplify a little bit and add an assertion.
MFC after: 2 weeks
No functional changes.
Most of the routing flags are stored in the netxtop instead of rtentry.
Rename rt->rt_flags to rt->rte_flags to simplify reading/modifying code
checking routing flags.
In the new multipath code, rt->rt_nhop may actually point to nexthop group
instead of nhop. To ease transition, reduce the amount of rt->rt_nhop->...
accesses.
Differential Revision: https://reviews.freebsd.org/D26156
Allow to dynamically grow the amount of fibs in each vnet.
This change alters current behavior. Currently, if one defines
ROUTETABLES > 1 in the kernel config, each vnet will be created
with the number of fibs defined in the kernel config.
After this commit vnets will be created with fibs=1.
Dynamic net.fibs is not compatible with net.add_addr_allfibs.
The plan is to deprecate the latter and make
net.add_addr_allfibs=0 default behaviour.
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26062
boundry change the last two IF_Mbps(2500) and additionally one
IF_Mbps(5000) to ULL as well.
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate")
Remove unused arguments from dom_rtattach/dom_rtdetach functions and make
them return/accept 'struct rib_head' instead of 'void **'.
Declare inet/inet6 implementations in the relevant _var.h headers similar
to domifattach / domifdetach.
Add rib_subscribe_internal() function to accept subscriptions to the rnh
directly.
Differential Revision: https://reviews.freebsd.org/D26053
The lagg_clone_destroy() handles detach and waiting for ifconfig callers
to drain already.
This narrows the race for 2 panics that the tests triggered. Both were a
consequence of adding a port to the lagg device after it had already detached
from all of its ports. The link state task would run after lagg_clone_destroy()
free'd the lagg softc.
kernel:trap_fatal+0xa4
kernel:trap_pfault+0x61
kernel:trap+0x316
kernel:witness_checkorder+0x6d
kernel:_sx_xlock+0x72
if_lagg.ko:lagg_port_state+0x3b
kernel:if_down+0x144
kernel:if_detach+0x659
if_tap.ko:tap_destroy+0x46
kernel:if_clone_destroyif+0x1b7
kernel:if_clone_destroy+0x8d
kernel:ifioctl+0x29c
kernel:kern_ioctl+0x2bd
kernel:sys_ioctl+0x16d
kernel:amd64_syscall+0x337
kernel:trap_fatal+0xa4
kernel:trap_pfault+0x61
kernel:trap+0x316
kernel:witness_checkorder+0x6d
kernel:_sx_xlock+0x72
if_lagg.ko:lagg_port_state+0x3b
kernel:do_link_state_change+0x9b
kernel:taskqueue_run_locked+0x10b
kernel:taskqueue_run+0x49
kernel:ithread_loop+0x19c
kernel:fork_exit+0x83
PR: 244168
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D25284
After moving the route control plane code from net/route.c,
all rtzone users ended up being in net/route_ctl.c.
Move uma(9) rtzone setup/teardown code to net/route_ctl.c as well
to have everything in a single place.
While here, remove custom initializers from the zone.
It was added originally to avoid setup/teardown of costy per-cpu couters.
With these counters removed, the only remaining job was avoiding rte mutex
setup/teardown. Mutex setup is relatively cheap. Additionally, this mutex
will soon be removed. With that in mind, there is no sense in keeping
custom zone callbacks.
Differential Revision: https://reviews.freebsd.org/D26051
It is possible for rn_delete() to return NULL. If this happens, then set
*perror to ESRCH, as is done in the rest of the function.
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D25871
For drivers with IFLIB_HAS_RXCQ set, there is a separate completion
queue. In this case, the netmap rxsync routine needs to update
rxq->ifr_cq_cidx in the same way it is updated by iflib_rxeof().
This improves the situation for vmx(4) and bnxt(4) drivers, which
use iflib and have the IFLIB_HAS_RXCQ bit set.
PR: 248494
MFC after: 3 weeks
First, fix the initialization of the fl->ifl_rxd_idxs array,
which was affected by an off-by-one bug.
Once there, refactor the function to use better names for
local variables, optimize the variable assignments, and
merge the bus_dmamap_sync() inner loop with the outer one.
PR: 248494
MFC after: 3 weeks
Netmap only uses free list 0 to keep it consistent with its
one-to-one mapping between each netmap ring and a device RX
(or TX) queue.
However, the current iflib_netmap_rxsync() routine was
mistakenly updating the ifl_cidx field of both free lists.
PR: 248494
MFC after: 2 weeks
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches. They no longer serve any
purpose since UMA now provides their functionality by default. Remove
them to simplyify the kernel memory allocator interfaces a bit.
Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25937
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
In case the network device has a RX or TX control queue, the correct
number of TX/RX descriptors is contained in the second entry of the
isc_ntxd (or isc_nrxd) array, rather than in the first entry.
This case is correctly handled by iflib_device_register() and
iflib_pseudo_register(), but not by iflib_netmap_attach().
If the first entry is larger than the second, this can result in a
panic. This change fixes the bug by introducing two helper functions
that also lead to some code simplification.
PR: 247647
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D25541
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
While it doesn't trigger INVARIANTS or WITNESS on head it does in stable/12.
There's also no reason for it, as we can easily report the out of memory error
to the caller (i.e. userspace). All of these can already fail.
PR: 248046
MFC after: 3 days