The ratelimit tags may be shared, especially for unlimited TLS
traffic, and then the refcount is allowed to be greater than one
when freeing the send tag.
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
APIs that have deferred callbacks should have some kind of cleanup
function that callers can use to fence the callbacks. Otherwise things
like module unloading can lead to dangling function pointers, or worse.
The IB MR code is the only place that calls this function and had a
really poor attempt at creating this fence. Provide a good version in
the core code as future patches will add more places that need this
fence.
Linux commit:
e355477ed9e4f401e3931043df97325d38552d54
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Report EQE data upon CQ completion to let upper layers use this data.
Linux commit:
4e0e2ea1886afe8c001971ff767f6670312a9b04
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Enhance mlx5_core_create_cq() to get the command out buffer from the
callers to let them use the output.
Linux commit:
38164b771947be9baf06e78ffdfb650f8f3e908e
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
To prevent a hardware memory leak when a DEVX DCT object is destroyed
without calling drain DCT before, (e.g. under cleanup flow), need to
manage its creation and destruction via mlx5 core.
Linux commit:
c5ae1954c47d3fd8815bd5a592aba18702c93f33
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Make sure order of cleanup is exactly the opposite of initialization.
Linux commit:
f4044dac63e952ac1137b6df02b233d37696e2f5
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
This both:
- makes ifconfig media line similar to that of other drivers.
- fixes ENXIO in case when paradoxical current media word is not registered.
Now e.g.
ifconfig mce0 -mediaopt txpause,rxpause
works by disabling pauses if enabled.
Sponsored by: Mellanox Technologies/NVidia Networking
MFC after: 1 week
Support for TLS rate limit tags is now in the tree, so this macro is
always defined.
Reviewed by: hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27020
This gives a more uniform API for send tag life cycle management.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000
Each TLS send tag in mlx5 contains a nested rate limit send tag.
Previously, the driver was calling internal functions to manage the
nested tag. Calling free methods directly instead of m_snd_tag_rele()
leaked send tag references and references on the ifp. Changes to use
the ifp methods for the nested tag for other methods are more cosmetic
but do simplify the code.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26996
- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).
- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.
- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.
- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.
- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.
- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].
Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691
Cleanup all host resources, SYSCTLs, MSIX vectors and memory used
by the host and only leave the device allocated memory behind, if any,
because it may still be in use, when the PCI remove function is called.
Else future probe calls may fail due to SYSCTLs already existing.
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway. This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).
In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.
Reviewed by: gallatin, hselasky, np, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26689
Previously the send tag was setup in the background, and all packets for
the given send tag were dropped until ready. Change this to be blocking
behaviour so that once the setsocketopt() for enabling TLS completes,
the socket is ready to send packets. Do this by simply flushing the
work request which does the needed firmware programming during send
tag allocation.
MFC after: 1 week
Sponsored by: Mellanox Technologies // Nvidia
PDDR (Port Diagnostics Database Register) is used to read the physical
layer debug database, which contains helpful troubleshooting information
regarding the state of the link.
PDDR register can only be queried when PCAM register reports it as
supported in its register mask. A new helper macro was added to
the MLX5_CAP_* infrastructure in order to access this mask.
Sponsored by: Mellanox Technologies - Nvidia
MFC after: 1 week
Currently the linking order of the infiniband, IB, modules decide in which
order the clients are attached and detached. For example one IB client may
use resources from another IB client. This can lead to a potential deadlock
at shutdown. For example if the ipoib is unregistered after the ib_multicast
client is detached, then if ipoib is using multicast addresses a deadlock may
happen, because ib_multicast will wait for all its resources to be freed before
returning from the remove method.
Fix this by using module_xxx_order() instead of module_xxx().
Differential Revision: https://reviews.freebsd.org/D23973
MFC after: 1 week
Sponsored by: Mellanox Technologies
Use fence instead of barrier, which is optimized to take advantage of
the x86 TSO memory model.
Reviewed by: hselasky
Sponsored by: Mellanox Technologies
MFC after: 1 week
Changes in the mbuf layout regarding HW TLS, resulted in wrong detection
of starting mbuf. Use a boolean variable to handle this and pass m_adj()
the top mbuf, so that the packet header is adjusted correctly.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Remove TSO from the toggle mask when automatically disabled by TXCKSUM* in
various NIC drivers.
Reviewed by: hselasky, np, gallatin, jpaetzel
Approved by: mav (mentor)
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D25120
Allow the TCP header to reside in the mbuf following the IP header.
Else such packets will get dropped.
Backtrace:
mlx5e_sq_xmit()
mlx5e_xmit()
ether_output_frame()
ether_output()
ip_output_send()
ip_output()
rip_output()
sosend_generic()
sosend()
kern_sendit()
sendit()
sys_sendto()
amd64_syscall()
fast_syscall_common()
MFC after: 1 week
Sponsored by: Mellanox Technologies
Typically the TCP/IP headers fit within the first mbuf and should not
trigger any of the error cases. Use unlikely() for these cases.
No functional change.
MFC after: 1 week
Sponsored by: Mellanox Technologies
When parsing the TCP/IP header in the fast path, make it clear by using
the const keyword, no fields are to be modified inside the transmitted
packet.
No functional change.
MFC after: 1 week
Sponsored by: Mellanox Technologies
For TLS v1.3 the 12 bytes of the initial vector, IV, should just be copied
as-is from the kernel to the gcm_iv field, which hold the first 4 bytes,
and the remaining 8 bytes go to the subsequent implicit_iv field.
There is no need to consider the byte order on the 12 bytes of IV like
initially done.
Sponsored by: Mellanox Technologies
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.
This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.
Many thanks to glebius for helping to make this better in
the Netflix tree.
Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213
When using SACK it can happen there are multiple option headers.
Don't drop these packets, but instead limit the amount of inlining
to the maximum supported.
MFC after: 1 week
Sponsored by: Mellanox Technologies
This includes 14 bytes of ethernet header and 2 bytes of VLAN header.
This allows for making assumptions about the inline size limit
in the fast transmit path later on.
Use a signed integer variable to catch underflow.
MFC after: 1 week
Sponsored by: Mellanox Technologies