These were seemingly copied over from icl_soft.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D30268
I botched a few of the changes when rebasing the changes in
4b6ed0758d across the changes in
43bbae1948.
- Move the counter allocations into alloc_ofld_rxq().
- Free the counters freeing an ofld rxq.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D30267
The CTL frontend might have provided a buffer that is smaller than the
FirstBurstLength and thus smaller than the amount of unsolicited data
included in the request PDU. Treat these transfers as an empty
transfer.
Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29940
A single union ctl_io can be reused across multiple transfers (in
particular by the ramdisk backend). On a reuse, the reservation
pointer would retain its value from the previous transfer tripping an
assertion.
Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29939
- Switch to allocating the cxgbei version of icl_pdu explicitly
as a separate refcounted object allocated via malloc/free
instead of storing it in the bhs mbuf prior to the bhs.
- Support the icl_conn_pdu_queue_cb() method to set a callback
on a PDU to be invoked when the PDU is freed.
- For ICL_NOCOPY buffers, use an external mbuf to manage the
storage for the buffer via m_extaddref(). Each external mbuf
holds a reference on the associated PDU, so the callback is
invoked once all of the external mbufs have been freed.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29910
- Only allocate 16K jumbo mbufs if the region of data to be
appended is sufficiently large, and use a loop.
- Use m_getm2() to allocate a chain for data less than 16K, or
if m_getjcl() fails.
- Use ENOMEM as the return value instead of '1' if the hook fails due
to a memory allocation error.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29909
A CAM target layer I/O CCB can use a S/G list of virtual address ranges
to describe its data buffer. This change adds zero-copy receive support
for such requests.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29908
As a result, CPL_FW4_ACK now returns credits for these work requests.
To support this, page pod work requests are now constructed in special
mbufs similar to "raw" mbufs used for NIC TLS in plain TX queues.
These special mbufs are stored in the ulp_pduq and dispatched in order
with PDU work requests.
Sponsored by: Chelsio Communications
Discussed with: np
Differential Revision: https://reviews.freebsd.org/D29904
The _ext event notification includes the address being added/removed and
that gives the driver an easy way to ignore non-IPv6 addresses. Remove
'tom' from the handler's name while here, it was moved out of t4_tom a
long time ago.
MFC after: 1 week
Sponsored by: Chelsio Communications
There is no need to panic in if_transmit if the checksums requested are
inconsistent with the frame being transmitted. This typically indicates
that the kernel and driver were built with different INET/INET6 options,
or there is some other kernel bug. The driver should just throw away
the requests that it doesn't understand and move on.
MFC after: 1 week
Sponsored by: Chelsio Communications
Add suspend/resume callbacks to the driver and a live reset built around
them. This commit covers the basic NIC and future commits will expand
this functionality to other stateful parts of the chip. Suspend and
resume operate on the chip (the t?nex nexus device) and affect all its
ports. It is not possible to suspend/resume or reset individual ports.
All these operations can be performed on a running NIC. A reset will
look like a link bounce to the networking stack.
Here are some ways to exercise this functionality:
/* Manual suspend and resume. */
# devctl suspend t6nex0
# devctl resume t6nex0
/* Manual reset. */
# devctl reset t6nex0
/* Manual reset with driver sysctl. */
# sysctl dev.t6nex.0.reset=1
/* Automatic adapter reset on any fatal error. */
# hw.cxgbe.reset_on_fatal_err=1
Suspend disables the adapter (DMA, interrupts, and the port PHYs) and
marks the hardware as unavailable to the driver. All ifnets associated
with the adapter are still visible to the kernel but operations that
require hardware interaction will fail with ENXIO. All ifnets report
link-down while the adapter is suspended.
Resume will reattach to the card, reconfigure it as before, and recreate
the queues servicing the existing ifnets. The ifnets are able to send
and receive traffic as soon as the link comes back up.
Reset is roughly the same as a suspend and a resume with at least one of
these events in between: D0->D3Hot->D0, FLR, PCIe link retrain.
MFC after: 1 month
Relnotes: yes
Sponsored by: Chelsio Communications
The driver uses both software resources (locks, callouts, memory for
descriptors and for bookkeeping, sysctls, etc.) and hardware resources
(VIs, DMA queues, TCAM entries, etc.) to operate the NIC. This commit
splits the single *_ALLOCATED flag used to track all these resources
into separate *_SW_ALLOCATED and *_HW_ALLOCATED flags.
This is the simplified pseudocode that now applies to most queues (foo
can be ctrlq/txq/rxq/ofld_txq/ofld_rxq):
/* Idempotent */
alloc_foo
{
if (!SW_ALLOCATED)
init_iq/init_eq/init_fl no-fail sw init
alloc_iq_fl/alloc_eq/alloc_wrq may-fail sw alloc
add_foo_sysctls, etc. no-fail post-alloc items
if (!HW_ALLOCATED)
alloc_iq_fl_hwq/alloc_eq_hwq hw resource allocation
}
/* Idempotent */
free_foo
{
if (!HW_ALLOCATED)
free_iq_fl_hwq/free_eq_hwq release hw resources
if (!SW_ALLOCATED)
free_iq_fl/free_eq/free_wrq release sw resources
}
The routines that take the driver to FULL_INIT_DONE and VI_INIT_DONE and
back are now all idempotent. The quiesce routines pay attention to the
HW_ALLOCATED flag and will not wait on the hardware for pidx/cidx
updates and other completions if this flag is not set.
MFC after: 1 month
Sponsored by: Chelsio Communications
There are two kinds of routines in the driver that read statistics from
the hardware: the cxgbe_* variants read the per-port MPS/MAC registers
and the vi_* variants read the per-VI registers. They can be called
from the 1Hz callout or if_get_counter. All stats collection now takes
place under the callout lock and there is a new flag to indicate that
these routines should not access any hardware register.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
A doomed VI does not have a valid ifnet.
Reported by: Jithesh Arakkan @ Chelsio
Reviewed by: np
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29662
The mbuf allocated could be a chain and must be freed with m_freem.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29579
There is no change in the source of the stats (t4_get_port_stats or
t4_get_vi_stats) but the per-port callout is gone.
Sponsored by: Chelsio Communications
Reviewed by: jhb@
Differential Revision: https://reviews.freebsd.org/D29527
This fixes a panic due to stale so->so_proto if t4_tom is unloaded and
one or more connections that were previously offloaded are still around
in TIME_WAIT state.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29503
This avoids some atomics by using counter_u64 for TX and relying on
existing single-threading (single ithread per rxq) for RX.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29383
This type mirrors struct sge_ofld_rxq and holds state for TCP offload
transmit queues. Currently it only holds a work queue but will
include additional state in future changes.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29382
Remove unused #includes of LinuxKPI headers noticed while trying to
solve LinuxKPI struct net_device and related functions.
Neither netdevice.h nor inetdevice.h nor notifier.h seem to be needed.
This takes cxgbe(4) out of the picture of D29366.
Sponsored-by: The FreeBSD Foundation
MFC-after: 2 weeks
Reviewed-by: np
X-D-R: D29366 (extracted as further cleanup)
Differential Revision: https://reviews.freebsd.org/D29432
The hw.cxgbe.kern_tls tunable was used for this in the past and if it
was set then all T6 adapters would be configured for NIC TLS operation
and could not be reconfigured for TOE without a reload. With this
change ifconfig can be used to manipulate toe and txtls caps like any
other caps. hw.cxgbe.kern_tls continues to work as usual but its
effects are not permanent any more.
* Enable nic_ktls_ofld in the default configuration file and use the
firmware instead of direct register manipulation to apply/rollback
NIC TLS configuration. This allows the driver to switch the hardware
between TOE and NIC TLS mode in a safe manner. Note that the
configuration is adapter-wide and not per-port.
* Remove the kern_tls config file as it works with 100G T6 cards only
and leads to firmware crashes with 25G cards. The configurations
included with the driver (with the exception of the FPGA configs) are
supposed to work with all adapters.
Reported by: Veeresh U.K. at Chelsio
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Reviewed by: jhb@
Differential Revision: https://reviews.freebsd.org/D29291
This avoids mixing the use of two different enums which modern C
compilers warn about.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29301
While here, make sure only the PF driver attempts to program the global
RSS key (with options RSS). The VF driver doesn't have access to those
device registers.
MFC after: 1 week
Sponsored by: Chelsio Communications
A repeat call will recreate the memory windows in the hardware and move
them to their last-known positions without repeating any of the software
initialization.
MFC after: 1 week
Sponsored by: Chelsio Communications
Completions for crypto requests on port 1 can sometimes return a stale
cookie value due to a firmware bug. Disable requests on port 1 by
default on affected firmware.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D26581
These fixes are only relevant for requests on the second port. In
some cases, the crypto completion data, completion message, and
receive descriptor could be written in the wrong order.
- Add a separate rx_channel_id that is a copy of the port's rx_c_chan
and use it when an RX channel ID is required in crypto requests
instead of using the tx_channel_id.
- Set the correct rx_channel_id in the CPL_RX_PHYS_ADDR used to write
the crypto result.
- Set the FID to the first rx queue ID on the adapter rather than the
queue ID of the first rx queue for the port.
- While here, use tx_chan to set the tx_channel_id though this is
identical to the previous value.
Reviewed by: np
Reported by: Chelsio QA
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29175
Firmware access from t4_attach takes place without any synchronization.
The driver should not panic (debug kernels) if something goes wrong in
early communication with the firmware. It should still load so that
it's possible to poke around with cxgbetool.
MFC after: 1 week
Sponsored by: Chelsio Communications
T5 and above have extra bits for the optional filter fields. This is a
correctness issue and not just a waste because a filter mode valid on a
T4 (36b) may not be valid on a T5+ (40b).
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Allow the filter mask (aka the hashfilter mode when hashfilters are
in use) to be set any time it is safe to do so. The requested mask
must be a subset of the filter mode already. The driver will not change
the mode or ingress config just to support a new mask.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
1. Query the firmware for filter mode, mask, and related ingress config
instead of trying to figure them out from hardware registers. Read
configuration from the registers only when the firmware does not
support this query.
2. Use the firmware to set the filter mode. This is the correct way to
do it and is more flexible as well. The filter mode (and associated
ingress config) can now be changed any time it is safe to do so.
The user can specify a subset of a valid mode and the driver will
enable enough bits to make sure that the mode is maxed out -- that
is, it is not possible to set another bit without exceeding the
total width for optional filter fields. This is a hardware
requirement that was not enforced by the driver previously.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Read the PF-only hardware settings directly in get_params__post_init.
Split the rest into two routines used by both the PF and VF drivers: one
that reads the SGE rx buffer configuration and another that verifies
miscellaneous hardware configuration.
MFC after: 1 week
Sponsored by: Chelsio Communications
These errors do not clear so to NULL, so the existing check was
treating these failures as success. The rest of do_pass_establish()
then tried to use the listen socket as if it was a connection socket
newly created by syncache_expand().
In addition, for negative return values, do not send a RST to the
peer.
Reported by: Sony Arpita Das @ Chelsio
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D28243
When refill_fl() fails to allocate large (9/16KB) mbuf cluster, it
falls back to safe (4KB) ones. But it still saved into sd->zidx
the original fl->zidx instead of fl->safe_zidx. It caused problems
with the later use of that cluster, including memory and/or data
corruption.
While there, make refill_fl() to use the safe zone for all following
clusters for the call, since it is unlikely that large succeed.
MFC after: 3 days
Sponsored by: iXsystems, Inc.
Reviewed by: np, jhb
Differential Revision: https://reviews.freebsd.org/D28716
- The behavior implemented in r362905 resulted in delayed transmission
of packets in some cases, causing performance issues. Use a different
heuristic to predict tx requests.
- Add a tunable/sysctl (hw.cxgbe.tx_coalesce) to disable tx coalescing
entirely. It can be changed at any time. There is no change in
default behavior.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.
The handshake timer can race with another thread sending a FIN or RST
to close a TOE TLS socket. Just bail from the timer without
rescheduling if the connection is closed when the timer fires.
Reported by: Sony Arpita Das @ Chelsio QA
Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D27583
The issue was found while building cxgbe with gcc 10 (in illumos),
the array subscription check is warning us about outside the bounds
access.
See also: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
By default, if a TOE TLS socket stops receiving data for more than 5
seconds, revert the connection back to plain TOE mode. This provides
a fallback if the userland SSL library does not support KTLS. In
addition, for client TLS 1.3 sockets using connect(), the TOE socket
blocks before the handshake has completed since the socket option is
only invoked for the final handshake.
The timeout defaults to 5 seconds, but can be changed at boot via the
hw.cxgbe.toe.tls_rx_timeout tunable or for an individual interface via
the dev.<nexus>.toe.tls_rx_timeout sysctl.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27470
This includes mbufs waiting for data from sendfile() I/O requests, or
mbufs awaiting encryption for KTLS.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27469
If TOE TLS is requested for an unsupported cipher suite or TLS
version, disable TLS processing and fall back to plain TOE. In
addition, if an error occurs when saving the decryption keys in the
card's memory, disable TLS processing and fall back to plain TOE.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27468
If a TOE TLS socket ends up using an unsupported TLS version or
ciphersuite, it must be downgraded to a "plain" TOE socket with TLS
encryption/decryption performed on the host. The previous
implementation of this fallback was incomplete and resulted in hung
connections.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27467
It is common for freelists to be starving when a netmap application
stops. Mailbox commands to free queues can hang in such a situation.
Avoid that by not freeing the queues when netmap is switched off.
Instead, use an alternate method to stop the queues without releasing
the context ids. If netmap is enabled again later then the same queue
is reinitialized for use. Move alloc_nm_rxq and txq to t4_netmap.c
while here.
MFC after: 1 week
Sponsored by: Chelsio Communications
r367917 fixed the backpressure on the netmap rxq being stopped but that
doesn't help if some other netmap rxq is starved (because it is stopping
too although the driver doesn't know this yet) and blocks the pipeline.
An alternate fix that works in all cases will be checked in instead.
Sponsored by: Chelsio Communications
The netmap application using the driver is responsible for replenishing
the receive freelists and they may be totally depleted when the
application exits. Packets in flight, if any, might block the pipeline
in case there aren't enough buffers left in the freelist. Avoid this by
filling up the freelists with a driver allocated buffer.
MFC after: 1 week
Sponsored by: Chelsio Communications
TCP SYNs in inner traffic will hit hardware listeners when VXLAN/NVGRE
rx parsing is enabled in the chip. t4_tom should pass on these SYNs to
the kernel and let it deal with them as if they arrived on the non-TOE
path.
Reported by: Sony at Chelsio
MFC after: 1 week
Sponsored by: Chelsio Communications
Otherwise, a socket can have a non-NULL tp->tod while TF_TOE is clear.
In particular, if a newly accepted socket falls back to non-TOE due to
an active open failure, the non-TOE socket will still have tp->tod set
even though TF_TOE is clear.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27028
The MAC address can be set with the optional mac-addr property in the VF
section of the iovctl.conf(5) used to instantiate the VFs.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Query the firmware for the MAC address set by the PF for the VF and use
it instead of the firmware generated MAC if it's available.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
This fixes a potential crash in firmware 1.25.0.0 on the passive open
side during TOE operation.
Obtained from: Chelsio Communications
MFC after: 1 week
Sponsored by: Chelsio Communications
This is mostly mechanical except for vmspace_exit(). There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace. In that case, upon switching to
vmspace0 we can unconditionally release the reference.
Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.
Reviewed by: alc, kib, mmel
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27057
- Get the number of classes from chip_params.
- Get the number of ethofld tids from the firmware.
- Do not let tcp_ratelimit allocate all traffic classes.
Sponsored by: Chelsio Communications
In certain edge cases, the NIC might have only received a partial TLS
record which it needs to return to the driver. For example, if the
local socket was closed while data was still in flight, a partial TLS
record might be pending when the connection is closed. Receiving a
RST in the middle of a TLS record is another example. When this
happens, the firmware returns the the partial TLS record as plain TCP
data via CPL_RX_DATA. Handle these requests by returning an error to
OpenSSL (via so_error for KTLS or via an error TLS record header for
the older Chelsio OpenSSL interface).
Reported by: Sony Arpita Das @ Chelsio
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: Revision: https://reviews.freebsd.org/D26800
The firmware can allocate ingress and egress context ids anywhere from
its configured range. Size the iq/eq maps to match the entire range
instead of assuming that the firmware always allocates the first
available context id.
Reported by: Baptiste Wicht @ Verisign
MFC after: 1 week
Sponsored by: Chelsio Communications
Flow control was disabled during initial TOE TLS development to
workaround a hang (and to match the Linux TOE TLS support for T6).
The rest of the TOE TLS code maintained credits as if flow control was
enabled which was inherited from before the workaround was added with
the exception that the receive window was allowed to go negative.
This negative receive window handling (rcv_over) was because I hadn't
realized the full implications of disabling flow control.
To clean this up, re-enable flow control on TOE TLS sockets. The
existing TPF_FORCE_CREDITS workaround is sufficient for the original
hang. Now that flow control is enabled, remove the rcv_over
workaround and instead assert that the receive window never goes
negative matching plain TCP TOE sockets.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D26799
r365732 was the first attempt to get an accurate count but it was
writing to some read-only registers to clear them and that obviously
didn't work. Instead, note the counter's value when it is supposed to
be cleared and subtract it from future readings.
dev.<port>.stats.rx_fcs_error should not be serviced from the MPS
register for T6.
The stats.* sysctls should all use T5_PORT_REG for T5 and above. This
must have been missed in the initial T5 support years ago. Fix it while
here.
MFC after: 3 days
Sponsored by: Chelsio Communications
These kind of drops come for free in the sense that they do not use the
filter TCAM or any other resource that wouldn't normally be used during
rx. Frames dropped by the hardware get counted in the MAC's rx stats
but are not delivered to the driver.
hw.cxgbe.attack_filter
Set to 1 to enable the "attack filter". Default is 0. The attack
filter will drop an incoming frame if any of these conditions is true:
src ip/ip6 == dst ip/ip6; tcp and src/dst ip is not unicast; src/dst ip
is loopback (127.x.y.z); src ip6 is not unicast; src/dst ip6 is loopback
(::1/128) or unspecified (::/128); tcp and src/dst ip6 is mcast
(ff00::/8).
hw.cxgbe.drop_ip_fragments
Set to 1 to drop all incoming IP fragments. Default is 0. Note that
this drops valid frames.
hw.cxgbe.drop_pkts_with_l2_errors
Set to 1 to drop incoming frames with Layer 2 length or checksum errors.
Default is 1.
hw.cxgbe.drop_pkts_with_l3_errors
Set to 1 to drop incoming frames with IP version, length, or checksum
errors. Default is 0.
hw.cxgbe.drop_pkts_with_l4_errors
Set to 1 to drop incoming frames with Layer 4 length, checksum, or other
errors. Default is 0.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway. This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).
In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.
Reviewed by: gallatin, hselasky, np, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26689
These tunables can only be set to a valid cluster size (2K, 4K, 9K, or
16K) as documented in the man page. Anything else could lead to a
panic on interface up.
Reported by: mav@
MFC after: 1 week
Sponsored by: Chelsio Communications
ccr(4) uses software to handle GCM and CCM requests not supported by
the crypto engine (e.g. with only AAD and no payload). This change
adds a fallback for a few more requests such as those with more SGL
entries than can fit in a work request (this can happen for GCM when
decrypting a TLS record split across 15 or more packets).
Reported by: Chelsio QA
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D26582
Bind the netmap tx queues to a special '0xff' scheduling class which
makes the firmware skip some processing related to rate limiting on the
outgoing traffic. Future firmwares will do this automatically.
MFC after: 1 week
Sponsored by: Chelsio Communications
- Only active netmap receive queues should be in the RSS lookup table.
- The RSS table should be restored for NIC operation when the last
active netmap queue is switched off, not the first one.
- Support repeated netmap ON/OFF on a subset of the queues. This works
whether the the queues being enabled and disabled are the only ones
active or not. Some kring indexes have to be reset in the driver for
the second case.
MFC after: 1 week
Sponsored by: Chelsio Communications
This allows the PF interfaces to communicate with the VF interfaces over
the internal switch in the ASIC. Fix the GL limits for VM work requests
while here.
MFC after: 3 days
Sponsored by: Chelsio Communications
Hardware assistance includes checksumming (tx and rx), TSO, and RSS on
the inner traffic in a VXLAN tunnel.
Relnotes: Yes
Sponsored by: Chelsio Communications
crypto(9) functions can now be used on buffers composed of an array of
vm_page_t structures, such as those stored in an unmapped struct bio. It
requires the running to kernel to support the direct memory map, so not all
architectures can use it.
Reviewed by: markj, kib, jhb, mjg, mat, bcr (manpages)
MFC after: 1 week
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D25671
Rx is more efficient within the chip when the receive buffer size
matches the TLS PDU size.
MFC after: 3 days
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D26127
to coalesce tx work requests.
Note that Coverity will still treat this as an out-of-bounds access. We
do want to compare 16B starting from ethmacdst but cmp_l2hdr was was
going beyond that by 2B.
cmp_l2hdr was introduced in r362905.
Reported by: Coverity (CID 1430284)
Sponsored by: Chelsio Communications
- Ask the firmware for the number of frames that can be stuffed in one
work request.
- Modify mp_ring to increase the likelihood of tx coalescing when there
are just one or two threads that are doing most of the tx. Add teeth
to the abdication mechanism by pushing the consumer lock into mp_ring.
This reduces the likelihood that a consumer will get stuck with all
the work even though it is above its budget.
- Add support for coalesced tx WR to the VF driver. This, with the
changes above, results in a 7x improvement in the tx pps of the VF
driver for some common cases. The firmware vets the L2 headers
submitted by the VF driver and it's a big win if the checks are
performed for a batch of packets and not each one individually.
Reviewed by: jhb@
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25454
- Move temporary sglists into the session structure and protect them
with a per-session lock instead of a per-adapter lock.
- Retire an unused session field, and move a debugging field under
INVARIANTS to avoid using the session lock for completion handling
when INVARIANTS isn't enabled.
- Use counter_u64 for per-adapter statistics.
Note that this helps for cases where multiple sessions are used
(e.g. multiple IPsec SAs or multiple KTLS connections). It does not
help for workloads that use a single session (e.g. a single GELI
volume).
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25457
In addition to reducing lines of code, this also ensures that the full
allocation is always zeroed avoiding possible bugs with incorrect
lengths passed to explicit_bzero().
Suggested by: cem
Reviewed by: cem, delphij
Approved by: csprng (cem)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25435
There were quite a few places where port_info was being accessed only to
get to the adapter.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25432
fibX_lookup_nh_ext().
fibX_lookup_nh_ represents pre-epoch generation of fib kpi,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D24975
Remove TSO from the toggle mask when automatically disabled by TXCKSUM* in
various NIC drivers.
Reviewed by: hselasky, np, gallatin, jpaetzel
Approved by: mav (mentor)
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D25120
Some crypto consumers such as GELI and KTLS for file-backed sendfile
need to store their output in a separate buffer from the input.
Currently these consumers copy the contents of the input buffer into
the output buffer and queue an in-place crypto operation on the output
buffer. Using a separate output buffer avoids this copy.
- Create a new 'struct crypto_buffer' describing a crypto buffer
containing a type and type-specific fields. crp_ilen is gone,
instead buffers that use a flat kernel buffer have a cb_buf_len
field for their length. The length of other buffer types is
inferred from the backing store (e.g. uio_resid for a uio).
Requests now have two such structures: crp_buf for the input buffer,
and crp_obuf for the output buffer.
- Consumers now use helper functions (crypto_use_*,
e.g. crypto_use_mbuf()) to configure the input buffer. If an output
buffer is not configured, the request still modifies the input
buffer in-place. A consumer uses a second set of helper functions
(crypto_use_output_*) to configure an output buffer.
- Consumers must request support for separate output buffers when
creating a crypto session via the CSP_F_SEPARATE_OUTPUT flag and are
only permitted to queue a request with a separate output buffer on
sessions with this flag set. Existing drivers already reject
sessions with unknown flags, so this permits drivers to be modified
to support this extension without requiring all drivers to change.
- Several data-related functions now have matching versions that
operate on an explicit buffer (e.g. crypto_apply_buf,
crypto_contiguous_subsegment_buf, bus_dma_load_crp_buf).
- Most of the existing data-related functions operate on the input
buffer. However crypto_copyback always writes to the output buffer
if a request uses a separate output buffer.
- For the regions in input/output buffers, the following conventions
are followed:
- AAD and IV are always present in input only and their
fields are offsets into the input buffer.
- payload is always present in both buffers. If a request uses a
separate output buffer, it must set a new crp_payload_start_output
field to the offset of the payload in the output buffer.
- digest is in the input buffer for verify operations, and in the
output buffer for compute operations. crp_digest_start is relative
to the appropriate buffer.
- Add a crypto buffer cursor abstraction. This is a more general form
of some bits in the cryptosoft driver that tried to always use uio's.
However, compared to the original code, this avoids rewalking the uio
iovec array for requests with multiple vectors. It also avoids
allocate an iovec array for mbufs and populating it by instead walking
the mbuf chain directly.
- Update the cryptosoft(4) driver to support separate output buffers
making use of the cursor abstraction.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24545
- Consistently use 'void *' for key schedules / key contexts instead
of a mix of 'caddr_t', 'uint8_t *', and 'void *'.
- Add a ctxsize member to enc_xform similar to what auth transforms use
and require callers to malloc/zfree the context. The setkey callback
now supplies the caller-allocated context pointer and the zerokey
callback is removed. Callers now always use zfree() to ensure
key contexts are zeroed.
- Consistently use C99 initializers for all statically-initialized
instances of 'struct enc_xform'.
- Change the encrypt and decrypt functions to accept separate in and
out buffer pointers. Almost all of the backend crypto functions
already supported separate input and output buffers and this makes
it simpler to support separate buffers in OCF.
- Remove xform_userland.h shim to permit transforms to be compiled in
userland. Transforms no longer call malloc/free directly.
Reviewed by: cem (earlier version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24855
o Shrink sglist(9) functions to work with multipage mbufs down from
four functions to two.
o Don't use 'struct mbuf_ext_pgs *' as argument, use struct mbuf.
o Rename to something matching _epg.
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598
The following series of patches addresses three things:
Now that array of pages is embedded into mbuf, we no longer need
separate structure to pass around, so struct mbuf_ext_pgs is an
artifact of the first implementation. And struct mbuf_ext_pgs_data
is a crutch to accomodate the main idea r359919 with minimal churn.
Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP.
The namespace for the newfeature is somewhat inconsistent and
sometimes has a lengthy prefixes. In these patches we will
gradually bring the namespace to "m_epg" prefix for all mbuf
fields and most functions.
Step 1 of 4:
o Anonymize mbuf_ext_pgs_data, embed in m_ext
o Embed mbuf_ext_pgs
o Start documenting all this entanglement
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598
This largely reuses the TLS TOE support added in r330884. However,
this uses the KTLS framework in upstream OpenSSL rather than requiring
Chelsio-specific patches to OpenSSL. As with the existing TLS TOE
support, use of RX offload requires setting the tls_rx_ports sysctl.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24453
- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
authentication algorithms and keys as well as the initial sequence
number.
- When reading from a socket using KTLS receive, applications must use
recvmsg(). Each successful call to recvmsg() will return a single
TLS record. A new TCP control message, TLS_GET_RECORD, will contain
the TLS record header of the decrypted record. The regular message
buffer passed to recvmsg() will receive the decrypted payload. This
is similar to the interface used by Linux's KTLS RX except that
Linux does not return the full TLS header in the control message.
- Add plumbing to the TOE KTLS interface to request either transmit
or receive KTLS sessions.
- When a socket is using receive KTLS, redirect reads from
soreceive_stream() into soreceive_generic().
- Note that this interface is currently only defined for TLS 1.1 and
1.2, though I believe we will be able to reuse the same interface
and structures for 1.3.
as the dma_device during RDMA registration.
cxgbe's struct device cannot be used as-is because it's a native FreeBSD
driver and ibcore is LinuxKPI based.
MFC after: 1 week
MFC after: r360196
The sole in-tree user of this flag has been retired, so remove this
complexity from all drivers. While here, add a helper routine drivers
can use to read the current request's IV into a local buffer. Use
this routine to replace duplicated code in nearly all drivers.
Reviewed by: cem
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24450
KTLS uses the flowid to distribute software encryption tasks among its
pool of worker threads. Without this change, all software KTLS
requests for TOE sockets ended up on the first worker thread.
Note that the flowid for TOE sockets created via connect() is not a
hash of the 4-tuple, but is instead the id of the TOE pcb (tid). The
flowid of TOE sockets created from TOE listen sockets do use the
4-tuple RSS hash as the flowid since the firmware provides the hash in
the message containing the original SYN.
Reviewed by: np (earlier version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24348
This fixes a panic when unloading and reloading t4_tom.ko since the
old pointer is still stored when t4_tom_load tries to set it.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24358
This fixes a panic that would occur when the timer tried to close a
stale socket.
Submitted by: Krishnamraju Eraparaju @ Chelsio
MFC after: 1 week
Sponsored by: Chelsio Communications
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.
This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.
In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.
This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,
This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.
In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.
Many thanks to glebius for helping to make this better in
the Netflix tree.
Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213
A T6 adapter contains two crypto engines on separate channels. This
commit distributes sessions between the two engines. Previously, only
the first engine was used.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24347