The new UAR API already offsets the UAR map pointer the mlx5en(4) is using.
While at it remove some no longer needed variables for keeping track
of the current BF offset.
This fixes a regression issue after the new UAR allocation APIs
were introduced.
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
This change include several changes as listed below all related to UAR.
UAR is a special PCI memory area where the so-called doorbell register and
blue flame register live. Blue flame is a feature for sending small packets
more efficiently via a PCI memory page, instead of using PCI DMA.
- All structures and functions named xxx_uuars were renamed into xxx_bfreg.
- Remove partially implemented Blueflame support from mlx5en(4) and mlx5ib.
- Implement blue flame register allocator.
- Use blue flame register allocator in mlx5ib.
- A common UAR page is now allocated by the core to support doorbell register
writes for all of mlx5en and mlx5ib, instead of allocating one UAR per
sendqueue.
- Add support for DEVX query UAR.
- Add support for 4K UAR for libmlx5.
Linux commits:
7c043e908a74ae0a935037cdd984d0cb89b2b970
2f5ff26478adaff5ed9b7ad4079d6a710b5f27e7
0b80c14f009758cefeed0edff4f9141957964211
30aa60b3bd12bd79b5324b7b595bd3446ab24b52
5fe9dec0d045437e48f112b8fa705197bd7bc3c0
0118717583cda6f4f36092853ad0345e8150b286
a6d51b68611e98f05042ada662aed5dbe3279c1e
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
The ratelimit tags may be shared, especially for unlimited TLS
traffic, and then the refcount is allowed to be greater than one
when freeing the send tag.
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Report EQE data upon CQ completion to let upper layers use this data.
Linux commit:
4e0e2ea1886afe8c001971ff767f6670312a9b04
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Enhance mlx5_core_create_cq() to get the command out buffer from the
callers to let them use the output.
Linux commit:
38164b771947be9baf06e78ffdfb650f8f3e908e
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
This both:
- makes ifconfig media line similar to that of other drivers.
- fixes ENXIO in case when paradoxical current media word is not registered.
Now e.g.
ifconfig mce0 -mediaopt txpause,rxpause
works by disabling pauses if enabled.
Sponsored by: Mellanox Technologies/NVidia Networking
MFC after: 1 week
Support for TLS rate limit tags is now in the tree, so this macro is
always defined.
Reviewed by: hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27020
This gives a more uniform API for send tag life cycle management.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000
Each TLS send tag in mlx5 contains a nested rate limit send tag.
Previously, the driver was calling internal functions to manage the
nested tag. Calling free methods directly instead of m_snd_tag_rele()
leaked send tag references and references on the ifp. Changes to use
the ifp methods for the nested tag for other methods are more cosmetic
but do simplify the code.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26996
- Add a new send tag type for a send tag that supports both rate
limiting (packet pacing) and TLS offload (mostly similar to D22669
but adds a separate structure when allocating the new tag type).
- When allocating a send tag for TLS offload, check to see if the
connection already has a pacing rate. If so, allocate a tag that
supports both rate limiting and TLS offload rather than a plain TLS
offload tag.
- When setting an initial rate on an existing ifnet KTLS connection,
set the rate in the TCP control block inp and then reset the TLS
send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
send tag. This allocates the TLS send tag asynchronously from a
task queue, so the TLS rate limit tag alloc is always sleepable.
- When modifying a rate on a connection using KTLS, look for a TLS
send tag. If the send tag is only a plain TLS send tag, assume we
failed to allocate a TLS ratelimit tag (either during the
TCP_TXTLS_ENABLE socket option, or during the send tag reset
triggered by ktls_output_eagain) and ignore the new rate. If the
send tag is a ratelimit TLS send tag, change the rate on the TLS tag
and leave the inp tag alone.
- Lock the inp lock when setting sb_tls_info for a socket send buffer
so that the routines in tcp_ratelimit can safely dereference the
pointer without needing to grab the socket buffer lock.
- Add an IFCAP_TXTLS_RTLMT capability flag and associated
administrative controls in ifconfig(8). TLS rate limit tags are
only allocated if this capability is enabled. Note that TLS offload
(whether unlimited or rate limited) always requires IFCAP_TXTLS[46].
Reviewed by: gallatin, hselasky
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26691
Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway. This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).
In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.
Reviewed by: gallatin, hselasky, np, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26689
Previously the send tag was setup in the background, and all packets for
the given send tag were dropped until ready. Change this to be blocking
behaviour so that once the setsocketopt() for enabling TLS completes,
the socket is ready to send packets. Do this by simply flushing the
work request which does the needed firmware programming during send
tag allocation.
MFC after: 1 week
Sponsored by: Mellanox Technologies // Nvidia
Currently the linking order of the infiniband, IB, modules decide in which
order the clients are attached and detached. For example one IB client may
use resources from another IB client. This can lead to a potential deadlock
at shutdown. For example if the ipoib is unregistered after the ib_multicast
client is detached, then if ipoib is using multicast addresses a deadlock may
happen, because ib_multicast will wait for all its resources to be freed before
returning from the remove method.
Fix this by using module_xxx_order() instead of module_xxx().
Differential Revision: https://reviews.freebsd.org/D23973
MFC after: 1 week
Sponsored by: Mellanox Technologies
Changes in the mbuf layout regarding HW TLS, resulted in wrong detection
of starting mbuf. Use a boolean variable to handle this and pass m_adj()
the top mbuf, so that the packet header is adjusted correctly.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Remove TSO from the toggle mask when automatically disabled by TXCKSUM* in
various NIC drivers.
Reviewed by: hselasky, np, gallatin, jpaetzel
Approved by: mav (mentor)
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D25120
Allow the TCP header to reside in the mbuf following the IP header.
Else such packets will get dropped.
Backtrace:
mlx5e_sq_xmit()
mlx5e_xmit()
ether_output_frame()
ether_output()
ip_output_send()
ip_output()
rip_output()
sosend_generic()
sosend()
kern_sendit()
sendit()
sys_sendto()
amd64_syscall()
fast_syscall_common()
MFC after: 1 week
Sponsored by: Mellanox Technologies
Typically the TCP/IP headers fit within the first mbuf and should not
trigger any of the error cases. Use unlikely() for these cases.
No functional change.
MFC after: 1 week
Sponsored by: Mellanox Technologies
When parsing the TCP/IP header in the fast path, make it clear by using
the const keyword, no fields are to be modified inside the transmitted
packet.
No functional change.
MFC after: 1 week
Sponsored by: Mellanox Technologies
For TLS v1.3 the 12 bytes of the initial vector, IV, should just be copied
as-is from the kernel to the gcm_iv field, which hold the first 4 bytes,
and the remaining 8 bytes go to the subsequent implicit_iv field.
There is no need to consider the byte order on the 12 bytes of IV like
initially done.
Sponsored by: Mellanox Technologies
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.
This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.
Many thanks to glebius for helping to make this better in
the Netflix tree.
Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213
When using SACK it can happen there are multiple option headers.
Don't drop these packets, but instead limit the amount of inlining
to the maximum supported.
MFC after: 1 week
Sponsored by: Mellanox Technologies
This includes 14 bytes of ethernet header and 2 bytes of VLAN header.
This allows for making assumptions about the inline size limit
in the fast transmit path later on.
Use a signed integer variable to catch underflow.
MFC after: 1 week
Sponsored by: Mellanox Technologies
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
switch over to opt-in instead of opt-out for epoch.
Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks
itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch
when processing its packets.
Now this will create recursive entrance in epoch in >90% network
drivers, but will guarantee safeness of the transition.
Mark several tested drivers as IFF_KNOWSEPOCH.
Reviewed by: hselasky, jeff, bz, gallatin
Differential Revision: https://reviews.freebsd.org/D23674
The upstream OpenSSL changes only set the cipher for GCM since the
authentication is redundant, and changes to OCF will soon remove the
GCM authentication algorithm constants entirely for the same reason.
In addition, ktls_create_session() already validates these fields and
wouldn't pass down an invalid auth_algorithm value to any drivers or
ktls backends.
Reviewed by: hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D23671
Make completion event path mostly lockless using EPOCH(9).
Implement a mechanism using EPOCH(9) which allows us to make
the callback path for completion events mostly lockless.
Simplify draining callback events using epoch_wait().
While at it make sure all receive completion callbacks are
covered by the network EPOCH(9), because this is required
when calling if_input() and ether_input() after r357012.
Sponsored by: Mellanox Technologies
ConnectX-6 DX.
Currently TLS v1.2 and v1.3 with AES 128/256 crypto over TCP/IP (v4
and v6) is supported.
A per PCI device UMA zone is used to manage the memory of the send
tags. To optimize performance some crypto contexts may be cached by
the UMA zone, until the UMA zone finishes the memory of the given send
tag.
An asynchronous task is used manage setup of the send tags towards the
firmware. Most importantly setting the AES 128/256 bit pre-shared keys
for the crypto context.
Updating the state of the AES crypto engine and encrypting data, is
all done in the fast path. Each send tag tracks the TCP sequence
number in order to detect non-contiguous blocks of data, which may
require a dump of prior unencrypted data, to restore the crypto state
prior to wire transmission.
Statistics counters have been added to count the amount of TLS data
transmitted in total, and the amount of TLS data which has been dumped
prior to transmission. When non-contiguous TCP sequence numbers are
detected, the software needs to dump the beginning of the current TLS
record up until the point of retransmission. All TLS counters utilize
the counter(9) API.
In order to enable hardware TLS offload the following sysctls must be set:
kern.ipc.mb_use_ext_pgs=1
kern.ipc.tls.ifnet.permitted=1
kern.ipc.tls.enable=1
Sponsored by: Mellanox Technologies