When different processes open separate subsets of the
available rings of a same netmap interface, a device
reset may be performed while one of the processes
is actively using some rings (e.g., caused by another
process executing a nmport_open()).
With this patch, such situation will cause the
active process to get a POLLERR, so that it can
have a chance to detect the situation.
We also guarantee that no process is running a txsync
or rxsync (ioctl or poll) while an iflib device reset
is in progress.
PR: 252453
MFC after: 1 week
Doing a 'dd' over iscsi will reliably cause stalls. Tx
cleaning _should_ reliably happen as data is sent.
However, currently if the transmit queue fills it will
wait until the iflib timer (hz/2) runs.
This change causes the the tx taskq thread to be run
if there are completed descriptors.
While here:
- make timer interrupt delay a sysctl
- simplify txd_db_check handling
- comment on INTR types
Background on the change:
Initially doorbell updates were minimized by only writing to the register
on every fourth packet. If txq_drain would return without writing to the
doorbell it scheduled a callout on the next tick to do the doorbell write
to ensure that the write otherwise happened "soon". At that time a sysctl
was added for users to avoid the potential added latency by simply writing
to the doorbell register on every packet. This worked perfectly well for
e1000 and ixgbe ... and appeared to work well on ixl. However, as it
turned out there was a race to this approach that would lockup the ixl MAC.
It was possible for a lower producer index to be written after a higher one.
On e1000 and ixgbe this was harmless - on ixl it was fatal. My initial
response was to add a lock around doorbell writes - fixing the problem but
adding an unacceptable amount of lock contention.
The next iteration was to use transmit interrupts to drive delayed doorbell
writes. If there were no packets in the queue all doorbell writes would be
immediate as the queue started to fill up we could delay doorbell writes
further and further. At the start of drain if we've cleaned any packets we
know we've moved the state machine along and we write the doorbell (an
obvious missing optimization was to skip that doorbell write if db_pending
is zero). This change required that tx interrupts be scheduled periodically
as opposed to just when the hardware txq was full. However, that just leads
to our next problem.
Initially dedicated msix vectors were used for both tx and rx. However, it
was often possible to use up all available vectors before we set up all the
queues we wanted. By having rx and tx share a vector for a given queue we
could halve the number of vectors used by a given configuration. The problem
here is that with this change only e1000 passed the necessary value to have
the fast interrupt drive tx when appropriate.
Reported by: mav@
Tested by: mav@
Reviewed by: gallatin@
MFC after: 1 month
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D27683
In some error paths we would fail to detach from the iflib taskqueue
groups. Also move the detach code into its own subroutine instead of
duplicating it.
Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27342
Code was supposed to call callout_reset_sbt_on() rather than
callout_reset_sbt(). This resulted into passing a "cpu" value
to a "flag" argument. A recipe for subtle errors.
PR: 248652
Reported by: sg@efficientip.com
MFC with: r367093
The way netmap TX is handled in iflib when TX interrupts are not
used (IFC_NETMAP_TX_IRQ not set) has some issues:
- The netmap_tx_irq() function gets called by iflib_timer(), which
gets scheduled with tick granularity (hz). This is not frequent
enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
result is that the transmitting netmap application is not woken
up fast enough to saturate the link with small packets.
- The iflib_timer() functions also calls isc_txd_credits_update()
to ask for more TX completion updates. However, this violates
the netmap requirement that only txsync can access the TX queue
for datapath operations. Only netmap_tx_irq() may be called out
of the txsync context.
This change introduces per-tx-queue netmap timers, using microsecond
granularity to ensure that netmap_tx_irq() can be called often enough
to allow for maximum packet rate. The timer routine simply calls
netmap_tx_irq() to wake up the netmap application. The latter will
wake up and call txsync to collect TX completion updates.
This change brings back line rate speed with small packets for ixgbe.
For the time being, timer expiration is hardcoded to 90 microseconds,
in order to avoid introducing a new sysctl.
We may eventually implement an adaptive expiration period or use another
deferred work mechanism in place of timers.
Also, fix the timers usage to make sure that each queue is serviced
by a different CPU.
PR: 248652
Reported by: sg@efficientip.com
MFC after: 2 weeks
The pidx argument of isc_rxd_flush() indicates which is the last valid
receive descriptor to be used by the NIC. However, current code has
multiple issues:
- Intel drivers write pidx to their RDT register, which means that
NICs will only use the descriptors up to pidx-1 (modulo ring size N),
and won't actually use the one pointed by pidx. This does not break
reception, but it is anyway confusing and suboptimal (the NIC will
actually see only N-2 descriptors as available, rather than N-1).
Other drivers (if_vmx, if_bnxt, if_mgb) adhere to this semantic).
- The semantic used by Intel (RDT is one descriptor past the last
valid one) is used by most (if not all) NICs, and it is also used
on the TX side (also in iflib). Since iflib is not currently
using this semantic for RX, it must decrement fl->ifl_pidx
(modulo N) before calling isc_rxd_flush(), and then the
per-driver callback implementation must increment the index
again (to match the real semantic). This is confusing and suboptimal.
- The iflib refill function is also called at initialization.
However, in case the ring size is smaller than 128 (e.g. if_mgb),
the refill function will actually prepare all the receive
descriptors (N), without leaving one unused, as most of NICs assume
(e.g. to avoid RDT to overrun RDH). I can speculate that the code
looks like this right now because this issue showed up during
testing (e.g. with if_mgb), and it was easy to workaround by
decrementing pidx before isc_rxd_flush().
The goal of this change is to simplify the code (removing a bunch
of instructions from the RX fast path), and to make the semantic of
isc_rxd_flush() consistent across drivers. To achieve this, we:
- change the semantics of the pidx argument to the usual one (that
is the index one past the last valid one), so that both iflib and
drivers avoid the decrement/increment dance.
- fix the initialization code to prepare at most N-1 descriptors.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26191
At initialization time, the netmap RX refill function used to
prepare the NIC RX ring with N-1 buffers rather than N (with
N equal to the number of descriptors in the NIC RX ring).
This is not how netmap is supposed to work, as it would keep
kring->nr_hwcur not in sync with the NIC "next index to refill"
(i.e., fl->ifl_pidx). Instead we prepare N buffers, although we
still publish (with isc_rxd_flush()) only the first N-1 buffers,
to avoid the NIC producer pointer to overrun the NIC consumer
pointer (for NICs where this is a real issue, e.g. Intel ones).
MFC after: 2 weeks
The semantic of the pidx argument of isc_rxd_flush() is the
last valid index of in the free list, rather than the next
index to be published. However, netmap was still using the
old convention. While there, also refactor the netmap_fl_refill()
to simplify a little bit and add an assertion.
MFC after: 2 weeks
For drivers with IFLIB_HAS_RXCQ set, there is a separate completion
queue. In this case, the netmap rxsync routine needs to update
rxq->ifr_cq_cidx in the same way it is updated by iflib_rxeof().
This improves the situation for vmx(4) and bnxt(4) drivers, which
use iflib and have the IFLIB_HAS_RXCQ bit set.
PR: 248494
MFC after: 3 weeks
First, fix the initialization of the fl->ifl_rxd_idxs array,
which was affected by an off-by-one bug.
Once there, refactor the function to use better names for
local variables, optimize the variable assignments, and
merge the bus_dmamap_sync() inner loop with the outer one.
PR: 248494
MFC after: 3 weeks
Netmap only uses free list 0 to keep it consistent with its
one-to-one mapping between each netmap ring and a device RX
(or TX) queue.
However, the current iflib_netmap_rxsync() routine was
mistakenly updating the ifl_cidx field of both free lists.
PR: 248494
MFC after: 2 weeks
In case the network device has a RX or TX control queue, the correct
number of TX/RX descriptors is contained in the second entry of the
isc_ntxd (or isc_nrxd) array, rather than in the first entry.
This case is correctly handled by iflib_device_register() and
iflib_pseudo_register(), but not by iflib_netmap_attach().
If the first entry is larger than the second, this can result in a
panic. This change fixes the bug by introducing two helper functions
that also lead to some code simplification.
PR: 247647
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D25541
- Get rid of the ifl_vm_addrs array. It is not used by any existing
consumer, so we are just dirtying a couple of cache lines for no
reason.
- Use uma_zalloc(fl->ifl_zone) instead of m_cljget(). Otherwise
m_cljget() is doing unnecessary work to look up the correct zone, when
iflib already knows what that zone is.
- ifl_gen is only used when INVARIANTS is on, so make that more clear.
- Fix some style nits and inconsistencies.
Reviewed by: gallatin
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25490
When refilling an rx freelist, make sure we only update the hardware
producer index if at least one cluster was allocated. Otherwise the
NIC is programmed to write a previously used cluster, typically
resulting in a use-after-free when packet data is written by the
hardware.
Also make sure that we don't update the fragment index cursor if the
last allocation attempt didn't succeed. For at least Intel drivers,
iflib assumes that the consumer index and fragment index cursor stay in
lockstep, but this assumption was violated in the face of cluster
allocation failures.
Reported and tested by: pho
Reviewed by: gallatin, hselasky
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25489
In the current iflib_netmap_rxsync, there is nothing that prevents
kring->nr_hwtail to overrun kring->nr_hwcur during the descriptor
import phase. This may cause errors in netmap applications, such as:
em1 RX0: fail 'head < kring->nr_hwcur || head > kring->nr_hwtail'
h 795 c 795 t 282 rh 795 rc 795 rt 282 hc 282 ht 282
Reviewed by: gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D25252
- a cloneattach failure will not currently be handled correctly,
jump to the right target
- pseudo devices are all treat as if they're ethernet devices -
this often doesn't make sense
MFC after: 1 week
Sponsored by: Netgate, Inc.
Differential Revision: https://reviews.freebsd.org/D25083
In the receive interrupt routine, always call netmap_rx_irq().
The latter function will return != NM_IRQ_PASS if netmap is not
active on that specific receive queue, so that the driver can go
on with iflib_rxeof(). Note that netmap supports partial opening,
where only a subset of the RX or TX rings can be open in netmap mode.
Checking the IFCAP_NETMAP flag is not enough to make sure that the
queue is indeed in netmap mode.
Moreover, in case netmap_rx_irq() returns NM_IRQ_RESCHED, it means
that netmap expects the driver to call netmap_rx_irq() again as soon
as possible. Currently, this may happen when the device is attached
to a VALE switch.
Reviewed by: gallatin
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25167
With a length of 16, the name ("<if name>:TX(<qid>):callout") typically
gets truncated.
PR: 245712
Reported by: ghuckriede@blackberry.com
MFC after: 1 week
This patch is intended to solve a specific problem that iavf(4)
encounters, but what it does can be extended to solve other issues.
To summarize the iavf(4) issue, if the PF driver configures VLAN
anti-spoof, then the VF driver needs to make sure no untagged traffic is
sent if a VLAN is configured, and vice-versa. This can be an issue when
a VLAN is being registered or unregistered, e.g. when a packet may be on
the ring with a VLAN in it, but the VLANs are being unregistered. This
can cause that tagged packet to go out and cause an MDD event.
To fix this, include a new interface-dependent function that drivers can
implement named IFDI_NEEDS_RESTART(). Right now, this function is called
in iflib_vlan_unregister/register() to determine whether the interface
needs to be stopped and started when a VLAN is registered or
unregistered. The default return value of IFDI_NEEDS_RESTART() is true,
so this fixes the MDD problem that iavf(4) encounters, since the
interface rings are flushed during a stop/init.
A future change to iavf(4) will implement that function just in case the
default value changes, and to make it explicit that this interface reset
is required when a VLAN is added or removed.
Reviewed by: gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22086
taskqgroup initialization was broken into two steps:
1. allocate the taskqgroup structure, at SI_SUB_TASKQ;
2. initialize taskqueues, start taskqueue threads, enqueue "binder"
tasks to bind threads to specific CPUs, at SI_SUB_SMP.
Step 2 tries to handle the case where tasks have already been attached
to a queue, by migrating them to their intended queue. In particular,
tasks can't be enqueued before step 2 has completed. This breaks NFS
mountroot on systems using an iflib-based driver when EARLY_AP_STARTUP
is not defined, since mountroot happens before SI_SUB_SMP in this case.
Simplify initialization: do all initialization except for CPU binding at
SI_SUB_TASKQ. This means that until CPU binding is completed, group
tasks may be executed on a CPU other than that to which they were bound,
but this should not be a problem for existing users of the taskqgroup
KPIs.
Reported by: sbruno
Tested by: bdragon, sbruno
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24188
ThunderX cluster systems are panicking on boot with a failed assertion
MPASS(gtask != NULL && gtask->gt_taskqueue != NULL). Split the
assertion so that it's clear which part is failing.
ifsd_cidx is never used, and the line removed from rxd_frag_to_sd() is
just dead code.
Reviewed by: erj, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23951
The vmx driver is an example of an iflib driver that might report
packets using non-contiguous descriptors (with unused descriptors
either between received packets or between the fragments of a received
packet), so this assertion needs to be removed.
For such drivers, the freelist producer and consumer indexes don't
relate directly to driver ring slots (the driver deals directly with
freelist buffer indexes supplied by iflib during refill, and reports
them with each fragment during packet reception), but do continue to
be used by iflib for accounting, such as determining the number of
ring slots that are refillable.
PR: 243126, 243392, 240628
Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
Reviewed by: gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23946
The dmamap for zero-length fragments should not be unloaded, as doing
so breaks the the cluster-reuse logic in _iflib_fl_refill().
All zero-length fragments are now handled by the assemble_segments()
path so that the cluster-reuse logic there does not have to be
replicated in the small-single-fragment-packet path of
iflib_rxd_pkt_get().
Packets consisting entirely of zero-length fragments (which result in
a NULL mbuf pointer) are now properly tolerated. This allows drivers
(such as the vmx driver) to pass such packets to iflib when a
descriptor error occurs during packet reception, the advantage being
that the refill of descriptors associated with the error packet are
handled via the existing iflib machinery without having to duplicate
parts of that machinery in the driver to handle that error case.
Reviewed by: avg, erj, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23945
This fixes a bug in iflib freelist management that breaks the required
correspondence between freelist indexes and driver ring slots.
PR: 243126, 243392, 240628
Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
Reviewed by: avg, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23943
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
switch over to opt-in instead of opt-out for epoch.
Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks
itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch
when processing its packets.
Now this will create recursive entrance in epoch in >90% network
drivers, but will guarantee safeness of the transition.
Mark several tested drivers as IFF_KNOWSEPOCH.
Reviewed by: hselasky, jeff, bz, gallatin
Differential Revision: https://reviews.freebsd.org/D23674
When the receive ring cannot be filled with mbufs, due to lack of memory,
no more interrupts may be generated to fill the receive ring later on.
Make sure to have a watchdog, to try refilling the receive ring from time
to time, hopefully when more mbufs are available.
Differential Revision: https://reviews.freebsd.org/D23315
MFC after: 1 week
Reviewed by: gallatin@
Sponsored by: Mellanox Technologies
In upcoming changes ether_input() is going to be changed not
to enter the network epoch. It is going to be responsibility
of network interrupt. In case of iflib - its taskqueue.
While changing link state in iflib_link_state_change(), queues are
marked as IFLIB_QUEUE_IDLE to disable watchdog. Currently, iflib_timer()
watchdog does not check for previous queue status before marking it as
IFLIB_QUEUE_HUNG.
This patch adds check of queue status before marking it as hung.
Signed-off-by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
PR: 239240
Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reported by: ultima@
Reviewed by: gallatin@, erj@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21712
DMA memory allocations using the bus_dma.h interface are not properly
released in all cases for both Tx and Rx. This causes ~448 bytes of
M_DEVBUF allocations to be leaked.
First, the DMA maps for Rx are not properly destroyed. A slight attempt
is made in iflib_fl_bufs_free to destroy the maps if we're detaching.
However, this function may not be reliably called during detach. Indeed,
there is a comment "asking" if this should be moved out.
Fix this by moving the bus_dmamap_destroy call into iflib_rx_sds_free,
where we already sync and unload the DMA.
Second, the DMA tag associated with the ifr_ifdi descriptor DMA is not
released properly anywhere. Add a call to iflib_dma_free in
iflib_rx_structures_free.
Third, use of NULL as a canary value on the map pointer returned by
bus_dmamap_create is not valid. On some platforms, notably x86, this
value may be NULL. In this case, we fail to properly release the related
resources.
Remove the NULL checks on map values in both iflib_fl_bufs_free and
iflib_txsd_destroy.
With all of these fixes applied, the leaks to M_DEVBUF are squelched,
and iflib drivers now seem to properly cleanup when detaching.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22203
From Jake:
The iflib stack failed to release all of the memory allocated under
M_IFLIB during device detach.
Specifically, the ifmp_ring, the ift_ifdi Tx DMA info, and the ifr_ifdi Rx
DMA info were not being released.
Release this memory so that iflib won't leak memory when a device
detaches.
Since we're freeing the ift_ifdi pointer during iflib_txq_destroy we
need to call this only after iflib_dma_free in iflib_tx_structures_free.
Additionally, also ensure that we destroy the callout mutex associated
with each Tx queue when we free it.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22157
From Jake:
Calling ether_ifdetach after iflib_stop leads to a potential race where
a stale ifp pointer can remain in the route entry list for IPv6 traffic.
This will potentially cause a page fault or other system instability if
the ifp pointer is accessed.
Move both iflib_netmap_detach and ether_ifdetach to be called prior to
iflib_stop. This avoids the race above, and helps ensure that other ifp
references are removed before stopping the interface.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@, jhb@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D22071
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).
It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).
The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.
Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.
The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.
No other functional change intended.
Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421
As of r347221 the iflib legacy interrupt mode setup assumes that drivers
perform both receive and transmit processing from the interrupt handler.
This assumption is invalid in the vmxnet3 driver, so introduce the
IFLIB_SINGLE_IRQ_RX_ONLY flag to make iflib avoid tx processing in the
interrupt handler.
PR: 239118
Reported and tested by: Juraj Lutter <otis@sk.freebsd.org>
Obtained from: marius
Reviewed by: gallatin
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21831