Dmamap is created only on IFC attach. If we remove it on
buffer release, we won't be able to do ifconfig down&up. Only destroy
when in detach.
Reported by: wma
Reviewed by: wma
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14060
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
As everywhere else, we want to pass rman_get_start(irq->ii_res). This
caused set affinity errors when not using MSI-X vectors (legacy and MSI
interrupts).
Reported by: sbruno
Sponsored by: Limelight Networks
gtask->gt_taskqueue is NULL when EARLY_AP_STARTUP is not enabled.
Remove assertion to allow this config to work.
Reported by: oleg
Sponsored by: Limelight Networks
It seems that tcp_lro_rx() doesn't verify TCP checksums, so
if there are bad checksums in the packets caused by invalid data, the
invalid data will pass through without errors.
This was noticed with the igb driver and a specific internet host:
fetch http://www.mpfr.org/mpfr-current/mpfr-3.1.6.tar.xz -o test.bin && sha256 test.bin
Would result in a different value sometimes.
This ends up making LRO require RXCSUM to be enabled, and RXCSUM to
support TCP and UDP checksums.
PR: 224346
Reported by: gjb
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13561
This will attempt to use a different thread/core on the same L2
cache when possible, or use the same cpu as the rx thread when not.
If SMP isn't enabled, don't go looking for cores to use. This is mostly
useful when using shared TX/RX queues.
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12446
Email address has changed, uses consistent name (Matthew, not Matt)
Reported by: Matthew Macy <mmacy@mattmacy.io>
Differential Revision: https://reviews.freebsd.org/D13537
Previously, the counter was only incremented when m_append() failed. Since
the function can also fail on m_dup() now, increment the counter there as
well.
Sponsored by: Limelight Networks
Fix memory leak where mbuf chain wasn't free()d if iflib_ether_pad()
has a failure in m_dup().
Reported by: "Ryan Stone" <rysto32@gmail.com>
Sponsored by: Limelight Networks
If ethernet padding is enabled, and a read-only mbuf is passed,
it would modify the mbuf using m_append(). Instead, call m_dup() and
append to the new packet.
Reported by: Pyun YongHyeon
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13414
Some bnxt devices do not correctly send frames smaller than
52 bytes (without CRC), so add a quirk that will pad frames to an
arbitrary size before passing off to the encap routine.
Reported by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13269
The LRO possible test was calling CURVNET_SET once for IPv4 or IPv6 for
each packet in a chain. Only call it once per chain instead.
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: cem, ae
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13368
SIOCGIFXMEDIA is required for extended ethernet media types,
but iflib did not support it.
Reported by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13312
If a device didn't support MSI-X, ctx->ifc_cpus would not be initialized,
but the IRQ allocation routines still uses the value. Move the
initialization to common code.
Sponsored by: Limelight Networks
bit_nclear() takes the bit numbers for the start and end bits, not the start
and a count. This was resulting in memory corruption past the end of the
bitstr_t.
Sponsored by: Limelight Networks
The intent appears to be having one RX/TX queue set per core,
but since scctx->isc_n[tr]xqsets is set to max before calling
iflib_msix_init(), both end up being set to total number of cores.
Use ctx->ifc_sysctl_n[rt]xqs as the selected value and
scctx->isc_n[rt]xqsets as the max. This should result in what appears
to be the intended behaviour
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13096
Preserve packet order between tcp_lro_rx() and if_input() to avoid
creating extra corner cases. If no packets can be LROed, combine them
into one chain for submission via if_input(). If any packet can
potentially be LROed however, retain old behaviour and call if_input()
for each packet.
This should keep the 12% improvement for small packet forwarding intact,
but mostly avoids impacting the LRO case.
Reviewed by: cem, sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12876
ifl_pidx and ifl_credits are going out of sync in _iflib_fl_refill() as they
use different update log. Use the same update logic for both, and add a
final call to isc_rxd_refill() to handle early exits from the loop.
PR: 221990
Reported by: pho
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12798
iru_init() was declared and used outside the DEV_NETMAP
conditional blocks, but was implemented inside one. Move the
implementation out of the DEV_NETMAP block to allow building with
netmap disabled.
Reported by: Andrew Turner <andrew@fubar.geek.nz>
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12842
Fix error when refilling netmap buffers that resulted in the first
buffer of the successive passes through ifl_bus_addrs[] leaving the
first value unset (tmp_pidx started at 1, not zero after the first time
through the loop).
Leave the one unused buffer required by some NICs visible in the netmap
ring rather than hidden. There will always be a buffer in use by the
kernel now when an iflib driver is used via netmap.
Always get the netmap slot index via netmap_idx_n2k() to account for
nkr_hwofs in a consistent way.
Split shared functionality into new functions.
iru_init(): shared by _iflib_fl_refill() and netmap_fl_refill()
netmap_fl_refill(): shared by iflib_netmap_rxsync() and
iflib_netmap_rxq_init()
PR: 222744
Reported by: Shirkdog <mshirk@daemon-security.com>
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12769
It was reported on the community call that with
hw.pci.enable_msix=0, iflib would enable MSI-X on the device and attempt
to use it, which caused issues. Test the sysctl explicitly and do not
enable MSI-X if it's disabled globally.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12805
1. prefetch 128 bytes of mbufs.
2. Re-order filling the pkt_info so cache stalls happen at the end
3. Define empty prefetch2cachelines() macro when the function isn't present.
Provides small performance improvments on some hardware
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12447
Improved logging added in r323879 exposed an error during
attach. We need the irq, not the rid to work correctly. em uses
shared irqs, so it will use the same irq for TX as RX. bnxt does
not use shared irqs, or TX irqs at all, so there's no need to set
the TX irq affinity.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12496
Move TX out of the enqueue() path. As a result, we need
to have ifmp_ring_check_drainage() pick up from the abdicate state.
We also need to either enqueue the TX task, or check drainage
after calling ifmp_ring_enqueue() to ensure it's sent.
This change results in a 30% small packet forwarding improvement.
Reviewed by: olivier, sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12439
This allows tuning the rx budget for special load profiles
as well as more easily testing to determine sane defaults.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12445
Build a list of mbufs to pass to if_input() after LRO. Results in
12% small packet forwarding rate improvement.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12444
If the packet is smaller than MTU, disable the TSO flags.
Move TCP header parsing inside the IS_TSO?() test.
Add a new IFLIB_NEED_ZERO_CSUM flag to indicate the checksums need to be zeroed before TX.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12442
RXQ setup for netmap was broken because netmap_rxq_init was getting called
before IFDI_INIT - thus we ended up with ring tail pointer being reset to zero.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12140
This was really too big of a commit even if everything worked, but there
are multiple new issues introduced in the one huge commit, so it's not
worth keeping this until it's fixed.
I'll work on splitting this up into logical chunks and introduce them one
at a time over the next week or two.
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
by Matt Macy as well as other changes which he has accepted via pull
request to his github repo at https://github.com/mattmacy/networking/
This should bring -CURRENT and the github repo into close enough sync to
allow small feature branches rather than a large chain of interdependant
patches being developed out of tree. The reset of the synchronization
should be able to be completed on github by splitting the remaining
changes that are not yet ready into short feature branches for later
review as smaller commits.
Here is a summary of changes included in this patch:
1) More checks when INVARIANTS are enabled for eariler problem
detection
2) Group Task Queue cleanups
- Fix use of duplicate shortdesc for gtaskqueue malloc type.
Some interfaces such as memguard(9) use the short description to
identify malloc types, so duplicates should be avoided.
3) Allow gtaskqueues to use ithreads in addition to taskqueues
- In some cases, this can improve performance
4) Better logging when taskqgroup_attach*() fails to set interrupt
affinity.
5) Do not start gtaskqueues until they're needed
6) Have mp_ring enqueue function enter the ABDICATED rather than BUSY
state. This moves the TX to the gtaskq and allows processing to
continue faster as well as make TX batching more likely.
7) Add an ift_txd_errata function to struct if_txrx. This allows
drivers to inspect/modify mbufs before transmission.
8) Add a new IFLIB_NEED_ZERO_CSUM for drivers to indicate they need
checksums zeroed for checksum offload to work. This avoids modifying
packet data in the TX path when possible.
9) Use ithreads for iflib I/O instead of taskqueues
10) Clean up ioctl and support async ioctl functions
11) Prefetch two cachlines from each mbuf instead of one up to 128B. We
often need to parse packet header info beyond 64B.
12) Fix potential memory corruption due to fence post error in
bit_nclear() usage.
13) Improved hang detection and handling
14) If the packet is smaller than MTU, disable the TSO flags.
This avoids extra packet parsing when not needed.
15) Move TCP header parsing inside the IS_TSO?() test.
This avoids extra packet parsing when not needed.
16) Pass chains of mbufs that are not consumed by lro to if_input()
rather call if_input() for each mbuf.
17) Re-arrange packet header loads to get as much work as possible done
before a cache stall.
18) Lock the context when calling IFDI_ATTACH_PRE()/IFDI_ATTACH_POST()/
IFDI_DETACH();
19) Attempt to distribute RX/TX tasks across cores more sensibly,
especially when RX and TX share an interrupt. RX will attempt to
take the first threads on a core, and TX will attempt to take
successive threads.
20) Allow iflib_softirq_alloc_generic() to request affinity to the same
cpus an interrupt has affinity with. This allows TX queues to
ensure they are serviced by the socket the device is on.
21) Add new iflib sysctls to net.iflib:
- timer_int - interval at which to run per-queue timers in ticks
- force_busdma
22) Add new per-device iflib sysctls to dev.X.Y.iflib
- rx_budget allows tuning the batch size on the RX path
- watchdog_events Count of watchdog events seen since load
23) Fix error where netmap_rxq_init() could get called before
IFDI_INIT()
24) e1000: Fixed version of r323008: post-cold sleep instead of DELAY
when waiting for firmware
- After interrupts are enabled, convert all waits to sleeps
- Eliminates e1000 software/firmware synchronization busy waits after
startup
25) e1000: Remove special case for budget=1 in em_txrx.c
- Premature optimization which may actually be incorrect with
multi-segment packets
26) e1000: Split out TX interrupt rather than share an interrupt for
RX and TX.
- Allows better performance by keeping RX and TX paths separate
27) e1000: Separate igb from em code where suitable
Much easier to understand separate functions and "if (is_igb)" than
previous tests like "if (reg_icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))"
#blamebruno
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12235
This creates conflicts with FreeBSD variations that may use it. The
usage of the flag M_TOOBIG is limited to iflib queue, thus using
one of M_PROTO flags is fine. There is no need to grab global flag.
Silence from: kmacy, sbruno (2 weeks)
Post-cold sleep instead of DELAY when waiting for firmware.
Convert softc mutex to an SX lock. Change all waits to sleeps
once interrupts are enabled (and it is safe to sleep).
Submitted by: Matt Macy <matt@mattmacy.io>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12101
Clang 5.0.0 got better warnings about printf format strings using %zd,
and this leads to the following -Werror warning on e.g. arm:
sys/net/iflib.c:1517:8: error: format specifies type 'ssize_t' (aka 'int') but the argument has type 'bus_size_t' (aka 'unsigned long') [-Werror,-Wformat]
sctx->isc_tx_maxsize, nsegments, sctx->isc_tx_maxsegsize);
^~~~~~~~~~~~~~~~~~~~
sys/net/iflib.c:1517:41: error: format specifies type 'ssize_t' (aka 'int') but the argument has type 'bus_size_t' (aka 'unsigned long') [-Werror,-Wformat]
sctx->isc_tx_maxsize, nsegments, sctx->isc_tx_maxsegsize);
^~~~~~~~~~~~~~~~~~~~~~~
Fix this by casting bus_size_t arguments to uintmax_t, and using %ju
instead.
Reviewed by: emaste
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D11679