ifsd_cidx is never used, and the line removed from rxd_frag_to_sd() is
just dead code.
Reviewed by: erj, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23951
The vmx driver is an example of an iflib driver that might report
packets using non-contiguous descriptors (with unused descriptors
either between received packets or between the fragments of a received
packet), so this assertion needs to be removed.
For such drivers, the freelist producer and consumer indexes don't
relate directly to driver ring slots (the driver deals directly with
freelist buffer indexes supplied by iflib during refill, and reports
them with each fragment during packet reception), but do continue to
be used by iflib for accounting, such as determining the number of
ring slots that are refillable.
PR: 243126, 243392, 240628
Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
Reviewed by: gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23946
The dmamap for zero-length fragments should not be unloaded, as doing
so breaks the the cluster-reuse logic in _iflib_fl_refill().
All zero-length fragments are now handled by the assemble_segments()
path so that the cluster-reuse logic there does not have to be
replicated in the small-single-fragment-packet path of
iflib_rxd_pkt_get().
Packets consisting entirely of zero-length fragments (which result in
a NULL mbuf pointer) are now properly tolerated. This allows drivers
(such as the vmx driver) to pass such packets to iflib when a
descriptor error occurs during packet reception, the advantage being
that the refill of descriptors associated with the error packet are
handled via the existing iflib machinery without having to duplicate
parts of that machinery in the driver to handle that error case.
Reviewed by: avg, erj, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23945
This fixes a bug in iflib freelist management that breaks the required
correspondence between freelist indexes and driver ring slots.
PR: 243126, 243392, 240628
Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
Reviewed by: avg, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23943
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:
- refectors lacp_select_tx_port_by_hash() and
lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
always called by lacp_select_tx_port()
- pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()
- adds a numa_domain field to if_snd_tag_alloc_params
- plumbs the numa domain into places where we allocate send tags
In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.
Reviewed by: rrs, hselasky, jhb
Sponsored by: Netflix
Implement counting of table entries linked on a per-table base
with an optional (if set > 0) limit of the maximum number of table
entries.
For that the public lltable_link_entry() and lltable_unlink_entry()
functions as well as the internal function pointers change from void
to having an int return type.
Given no consumer currently sets the new llt_maxentries this can be
committed on its own. The moment we make use of the table limits,
the callers of the link function must check the return value as
it can change and entries might not be added.
Adjustments for IPv6 (and possibly IPv4) will follow.
Sponsored by: Netflix (originally)
Reviewed by: melifaro
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22713
When generating an cloned interface instance in edsc_clone_create(),
generate a MAC address from the FF OUI with ether_gen_addr(). This allows us
to have unique local-link addresses. Previously, the MAC address was zero.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D23842
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
like the mlx-c5 and c6 that require a "setup" routine before
the tcp_ratelimit code can declare and use a rate. I add the
setup routine to if_var as well as fix tcp_ratelimit to call it.
I also revisit the rates so that in the case of a mlx card
of type c5/6 we will use about 100 rates concentrated in the range
where the most gain can be had (1-200Mbps). Note that I have
tested these on a c5 and they work and perform well. In fact
in an unloaded system they pace right to the correct rate (great
job mlx!). There will be a further commit here from Hans that
will add the respective changes to the mlx driver to support this
work (which I was testing with).
Sponsored by: Netflix Inc.
Differential Revision: ttps://reviews.freebsd.org/D23647
The locking defines for if_bridge used to live in if_bridgevar.h, but
they're only ever used by the bridge implementation itself (in
if_bridge.c). Moving them into the .c file.
Reported by: philip, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23808
switch over to opt-in instead of opt-out for epoch.
Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks
itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch
when processing its packets.
Now this will create recursive entrance in epoch in >90% network
drivers, but will guarantee safeness of the transition.
Mark several tested drivers as IFF_KNOWSEPOCH.
Reviewed by: hselasky, jeff, bz, gallatin
Differential Revision: https://reviews.freebsd.org/D23674
Revert parts of r353274 replacing vnet_state with a shutdown flag.
Not having the state flag for the current SI_SUB_* makes it harder to debug
kernel or module panics related to VNET bringup or teardown.
Not having the state also does not allow us to check for other dependency
levels between components, e.g. for moving interfaces.
Expand the VNET structure with the new boolean flag indicating that we are
doing a shutdown of a given vnet and update the vnet magic cookie for the
change.
Update libkvm to compile with a bool in the kernel struct.
Bump __FreeBSD_version for (external) module builds to more easily detect
the change.
Reviewed by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23097
When VIMAGE is enabled make sure the "m_pkthdr.rcvif" pointer is set
for all mbufs being input by the IGMP/MLD6 code. Else there will be a
NULL-pointer dereference in the netisr code when trying to set the
VNET based on the incoming mbuf. Add an assert to catch this when
queueing mbufs on a netisr to make debugging of similar cases easier.
Found by: Vladislav V. Prodan
PR: 244002
Reviewed by: bz@
MFC after: 1 week
Sponsored by: Mellanox Technologies
When the receive ring cannot be filled with mbufs, due to lack of memory,
no more interrupts may be generated to fill the receive ring later on.
Make sure to have a watchdog, to try refilling the receive ring from time
to time, hopefully when more mbufs are available.
Differential Revision: https://reviews.freebsd.org/D23315
MFC after: 1 week
Reviewed by: gallatin@
Sponsored by: Mellanox Technologies
The reason for this change is to make it clear the scope of the in-kernel usage
of IFM_TYPE_DESCRIPTIONS and IFM_SUBTYPE_ETHERNET_DESCRIPTIONS macros. Also it
is somewhat better C.
Reviewed by: hselasky
Sponsored by: Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D23620
Recent network epoch changes have left some drivers unexpectedly broken
and there is not yet a consensus on the correct fix. This is patch is
a minor performance impact until we can agree on the correct path
forward.
Reviewed by: core, network, imp, glebius, hselasky
Differential Revision: https://reviews.freebsd.org/D23515
During vnet cleanup vnet_if_uninit() checks that no more interfaces remain in
the vnet. Any interface borrowed from another vnet is returned by
vnet_if_return(). Other interfaces (i.e. cloned interfaces) should have been
destroyed by their cloner at this point.
The if_vlan VNET_SYSUNINIT had priority SI_ORDER_FIRST, which means it had
equal priority as vnet_if_uninit(). In other words: it was possible for it to
be called *after* vnet_if_uninit(), which would lead to assertion failures.
Set the priority to SI_ORDER_ANY, like other cloners to ensure that vlan
interfaces are destroyed before we enter vnet_if_uninit().
The sys/net/if_vlan test provoked this.
Software interrupt handlers are allowed to sleep. In swi_net() there
is a read lock behind NETISR_RLOCK() which in turn ends up calling
msleep() which means the whole of swi_net() cannot be protected by an
EPOCH(9) section. By default the NETISR_LOCKING feature is disabled.
This issue was introduced by r357004. This is a preparation step for
replacing the functionality provided by r357004.
Found by: kib@
Sponsored by: Mellanox Technologies
if_epair used the 'params' argument to pass a pointer to the b interface
through if_clone_create().
This pointer can be controlled by userspace, which means it could be abused to
trigger a panic. While this requires PRIV_NET_IFCREATE
privileges those are assigned to vnet jails, which means that vnet jails
could panic the system.
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
MFC after: 3 days
supposedly may call into ether_input() without network epoch.
They all need to be reviewed before 13.0-RELEASE. Some may need
be fixed. The flag is not planned to be used in the kernel for
a long time.
In upcoming changes ether_input() is going to be changed not
to enter the network epoch. It is going to be responsibility
of network interrupt. In case of iflib - its taskqueue.
Those interfaces may implicitly change their MTU on addition of parent
interface in addition to normal SIOCSIFMTU ioctl path, where the route
MTUs are updated normally.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Redirect (and temporal) route expiration was broken a while ago.
This change brings route expiration back, with unified IPv4/IPv6 handling code.
It introduces net.inet.icmp.redirtimeout sysctl, allowing to set
an expiration time for redirected routes. It defaults to 10 minutes,
analogues with net.inet6.icmp6.redirtimeout.
Implementation uses separate file, route_temporal.c, as route.c is already
bloated with tons of different functions.
Internally, expiration is implemented as an per-rnh callout scheduled when
route with non-zero rt_expire time is added or rt_expire is changed.
It does not add any overhead when no temporal routes are present.
Callout traverses entire routing tree under wlock, scheduling expired routes
for deletion and calculating the next time it needs to be run. The rationale
for such implemention is the following: typically workloads requiring large
amount of routes have redirects turned off already, while the systems with
small amount of routes will not inhibit large overhead during tree traversal.
This changes also fixes netstat -rn display of route expiration time, which
has been broken since the conversion from kread() to sysctl.
Reviewed by: bz
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D23075
Correction after r333476:
- write this as LOG_DEBUG again instead of LOG_INFO;
- get back function name into the message;
- error may be ESRCH if an address is removed in process (by carp f.e.),
not only ENOENT;
- expression complexity grows, so try making it more readable.
MFC after: 1 week