- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
This was enumerated with exhaustive search for sys/eventhandler.h includes,
cross-referenced against EVENTHANDLER_* usage with the comm(1) utility. Manual
checking was performed to avoid redundant includes in some drivers where a
common os_bsd.h (for example) included sys/eventhandler.h indirectly, but it is
possible some of these are redundant with driver-specific headers in ways I
didn't notice.
(These CUs did not show up as missing eventhandler.h in tinderbox.)
X-MFC-With: r347984
Function 'mlx5e_alloc_rx_wqe' can never be inlined because it uses alloca
(override using the always_inline attribute)
MFC after: 3 days
Sponsored by: Mellanox Technologies
Allow more macro arguments and split the variable type and name into
separate arguments. This allows simple and powerful copy and extraction
of values from IFC based structures into SYSCTLs with the use of a single
macro.
MFC after: 3 days
Sponsored by: Mellanox Technologies
mlx5en(4).
IFM_10G_LR and IFM_40G_ER4 media should be added only if the device
has the needed capability bit set for it.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
If generation is cleared due to hardware clock failure, check for it before
the divisor is used. Actually clear generation when failure occurs.
While there, stop doing the calculations inside the generation loop. Since
all members of mlx5e_clbr_point are used for calculations, get the
local copy of the structure and use it after generation stabilized.
Submitted by: kib@
MFC after: 3 days
Sponsored by: Mellanox Technologies
This fixes counting issues when the firmware resets the counter during
allocation of the counter set where the counter belongs.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The physical- and virtual- port counters might not reflect the amount
of data received after address filtering. Use the software counters
instead for rx_packets and rx_bytes to know exactly how much data
was received.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The current mapping of driver counters to netstat counters is wrong.
For example, a single jabber packet, will cause the Ierrs counter to
count three times.
The work for mapping the hardware and software counters to their right
place in netstat counters were already done in Linux, take that as is
to the FreeBSD driver.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
Else the SQs won't be properly released when closing rate-limited connections
leading to wrong state transitions on the SQ.
MFC after: 3 days
Sponsored by: Mellanox Technologies
When using CQE zipping, one can choose between RX hash and Checksum.
This will indicate the parameter on which a zipping session should be
stopped.
While porting the Linux code, Checksum was chosen. However, the value
of Checksum is not being used anywhere.
For the FreeBSD driver, we prefer to use the RX hash format which will
guarantee the RX hash value for all the mini CQEs.
While at it, make sure to initialize the Checksum value in the
decompressed CQE.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
After doing performance measurements, it seems like CQE zipping doesn't
have any significant benefit.
Moreover, we know that this feature is disabled by default on other
operating systems (Linux for example).
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
Split the function into the mlx5e_update_stats_locked() core and make
mlx5e_update_stats_work() call the _locked helper, similar to many other
places in the kernel. This improves the code structure, making the
locking clean.
Submitted by: kib@
MFC after: 3 days
Sponsored by: Mellanox Technologies
Instead of waiting for all jobs to be cancelled, simply close the completion
queue to prevent more completion events and let mlx5e_destroy_rq() cleanup
the remaining mbufs.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The number of priorities is always 8, while the number of traffic classes
supported can vary. While at it convert the sysctl node into an array.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Instead of reading Ethernet RFC 2819 pXtoYoctets counters from
hardware which counts RX octets, count tx_stat_pXtoYoctets from
Ethernet extended counters which counts TX octets.
TX jumbo counters should be accumulated only after the PPCNT
counters were fetched from hardware with their latest value.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
Add support for DIM based on Linux,
with some minor adaptions specific to FreeBSD.
Linux commit
f97c3dc3c0e8d23a5c4357d182afeef4c67f5c33
MFC after: 3 days
Sponsored by: Mellanox Technologies
Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.
Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.
Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20028
This commit adds new if_alloc_domain() and if_alloc_dev() methods to
allocate ifnets. When called with a domain on a NUMA machine,
ifalloc_domain() will record the NUMA domain in the ifnet, and it will
allocate the ifnet struct from memory which is local to that NUMA
node. Similarly, if_alloc_dev() is a wrapper for if_alloc_domain
which uses a driver supplied device_t to call ifalloc_domain() with
the appropriate domain.
Note that the new if_numa_domain field fits in an alignment pad in
struct ifnet, and so does not alter the size of the structure.
Reviewed by: glebius, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19930
This allows efficient filtering at packet ingress on mlx5en.
Note that the packets are filtered (and potentially dropped) *before*
the driver has committed to (re)allocating an mbuf for the
packet. Dropped packets are treated essentially the same as an
error. Nothing is allocated, and the existing buffer is recycled. This
allows us to drop malicious packets at close to line rate with very
little CPU use.
Reviewed by: hselasky, slavash, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19063
The backpressure indication is implemented using an unlimited rate type of
mbuf send tag. When the upper layers typically the socket layer has obtained such
a tag, it can then query the destination driver queue for the current
amount of space available in the send queue.
A single mbuf send tag may be referenced multiple times and a refcount has been added
to the mlx5e_priv structure to track its usage. Because the send tag resides
in the mlx5e_channel structure, there is no need to wait for refcounts to reach
zero until the mlx4en(4) driver is detached. The channels structure is persistant
during the lifetime of the mlx5en(4) driver it belongs to and can so be accessed
without any need of synchronization.
The mlx5e_snd_tag structure was extended to contain a type field, because there are now
two different tag types which end up in the driver which need to be distinguished.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
In order to enable HW LRO, both the "hw_lro" sysctl in the mlx5en(4) config
space must be set, and the ifconfig(8) LRO capability must be set. Any other
settings will disable HW LRO.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add counter for all transmitted and received bytes. Currently only all
transmitted and received packets were counted. Fix description of RX LRO
counters while at it.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
By allocating the worst case size channel structure array
at attach time we can eliminate various NULL checks in the
fast path. And also reduce the chance for use-after-free
issues in the transmit fast path.
This change is also a requirement for implementing
backpressure support.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Writing to the debug stats variable must be locked,
else serialization will be lost which might cause
various kernel panics due to creating and destroying
sysctls out of order.
Make sure the sysctl context is initialized after freeing
the sysctl nodes, else they can be freed twice.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Inspect the ethernet compliance code to figure out actual cable type by reading
the PDDR module info register.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This can happen when connections are short lived and leads to
a firmware error printout in dmesg, syndrome 0x51cfb0, because
the SQ is in the wrong state.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
1) Don't exceed the drivers own hardcoded TX inline limit.
The blueflame register size can be much greater than the hardcoded limit
for inlining. Make sure we don't exceed the drivers own limit, because this
also means that the maximum number of TX fragments becomes invalid and
then memory size assumptions in the TX path no longer hold up.
2) Make sure the mlx5_query_min_inline() function returns an error code.
3) Header inlining is required when using TSO.
4) Catch failure to compute inline header size for TSO.
5) Add support for UDP when computing inline header size.
6) Fix for inlining issues with regards to DSCP.
Make sure we inline 4 bytes beyond the ethernet and/or
VLAN header to workaround a hardware bug extracting
the DSCP field from the IPv4/v6 header.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The hardware queues are deep enough currently and using the DRBR and associated
callbacks only leads to more task switching in the TX path. The is also a race
setting the queue_state which can lead to hung TX rings.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add support for setting the bandwidth limit as a ratio rather than in bits per
second. The ratio must be an integer number between 1 and 100 inclusivly.
Implement the needed firmware commands and SYSCTLs through mlx5en(4).
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Driver description should be set by core and not by the Ethernet driver.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This counter will represent transmitted packets which has more than
1518 octets.
The NIC has multiple hardware counters for counting transmitted
packets larger than 1518 octets. Each counter counts the packets
in specific range.
We accumulate those counters to have a single counter.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies