- Fix PR 227760 by getting the TOE to respond to the SYN after the call
to toe_syncache_add, not during it. The kernel syncache code calls
syncache_respond just before syncache_insert. If the ACK to the
syncache_respond is processed in another thread it may run before the
syncache_insert and won't find the entry. Note that this affects only
t4_tom because it's the only driver trying to insert and expand
syncache entries from different threads.
- Do not leak resources if an embryonic connection terminates at
SYN_RCVD because of L2 lookup failures.
- Retire lctx->synq and associated code because there is never a need to
walk the list of embryonic connections associated with a listener.
The per-tid state is still called a synq entry in the driver even
though the synq itself is now gone.
PR: 227760
MFC after: 2 weeks
Sponsored by: Chelsio Communications
- Store the clip table in 'struct adapter' instead of in the TOM softc.
- Init the clip table during attach and teardown during detach.
- While here, add a dev.<nexus>.<unit>.misc.clip sysctl to dump the
CLIP table.
This does mean that we update the clip table even if TOE is not enabled,
but non-TOE things need the CLIP table anyway.
Reviewed by: np, Krishnamraju Eraparaju @ Chelsio
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D18010
vmem's are not just used for TLS memory in TOM and the #include actually
predates the TLS code so should not have been removed when the TLS vmem
moved in r340466.
Pointy hat to: jhb
Sponsored by: Chelsio Communications
The key context is always placed immediately after the work request
header. The total work request length has to be rounded up by 16
however.
MFC after: 1 month
Sponsored by: Chelsio Communications
For TOE TLS, we just want to advance the send pointer to skip over the
record just sent to the TOE. The recently added sbsndptr_adv() is
sufficient for that and is cheaper.
MFC after: 1 month
Sponsored by: Chelsio Communications
r254889 added tcp_state_change() as a centralized place to log state
changes in TCP connections for DTrace. r294869 and r296881 took
advantage of this central location to manage per-state counters.
However, TOE sockets were still performing some (but not all) state
change updates via direct assignments to t_state. This resulted in
state counters underflowing when TOE was in use. Fix by using
tcp_state_change() when changing a TOE connection's state.
Reviewed by: np, markj
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D17915
- Switch to using 32b port/link capabilities in the driver. The 32b
format is used internally by firmwares > 1.16.45.0 and the driver will
now interact with the firmware in its native format, whether it's 16b
or 32b. Note that the 16b format doesn't have room for 50G, 200G, or
400G speeds.
- Add a bit in the pause_settings knobs to allow negotiated PAUSE
settings to override manual settings.
- Ensure that manual link settings persist across an administrative
down/up as well as transceiver unplug/replug.
- Remove unused is_*G_port() functions.
Approved by: re@ (gjb@)
MFC after: 1 month
Sponsored by: Chelsio Communications
own region in the TCAM starting with T6, unlike previous chips where
they were in the same region as normal filters.
These filters "hit" before anything else in the LE's lookup. The exact
order is:
a) High priority filters
b) TOE's active region (TCAM and/or hash)
c) Servers (TOE hw listeners)
d) Normal filters
MFC after: 1 week
Sponsored by: Chelsio Communications
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.
Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
If a socket is closed or shutdown and a partial record (or what
appears to be a partial record) is waiting in the socket buffer,
discard the partial record and close the connection rather than
waiting forever for the rest of the record.
Reported by: Harsh Jain @ Chelsio
Sponsored by: Chelsio Communications
- Driver support for hardware NAT.
- Driver support for swapmac action.
- Validate a request to create a hashfilter against the filter mask.
- Add a hashfilter config file for T5.
Sponsored by: Chelsio Communications
These filters reside in the card's memory instead of its TCAM and can be
configured via a new "hashfilter" subcommand in cxgbetool. Hash and
normal TCAM filters can be used together. The hardware does an
exact-match of packet fields for hash filters, unlike the masked match
performed for TCAM filters. Any T5/T6 card with memory can support at
least half a million hash filters. The sample config file with the
driver configures 512K of these, it is possible to double this to 1
million+ in some cases.
The chip does an exact-match of fields of incoming datagrams with hash
filters and performs the action configured for the filter if it matches.
The fields to match are specified in a "filter mask" in the firmware
config file. The filter mask always includes the 5-tuple (sip, dip,
sport, dport, ipproto). It can, optionally, also include any subset of
the filter mode (see filterMode and filterMask in the firmware config
file).
For example:
filterMode = fragmentation, mpshittype, protocol, vlan, port, fcoe
filterMask = protocol, port, vlan
Exact values of the 5-tuple, the physical port, and VLAN tag would have
to be provided while setting up a hash filter with the chip
configuration above.
Hash filters support all actions supported by TCAM filters. A packet
that hits a hash filter can be dropped, let through (with optional
steering to a specific queue or RSS region), switched out of another
port (with optional L2 rewrite of DMAC, SMAC, VLAN tag), or get NAT'ed.
(Support for some of these will show up in the driver in a follow-up
commit very shortly).
Sponsored by: Chelsio Communications
Reserve 3b in the 14b atid to identify the owner and use it to dispatch
the CPL. This allows all CPLs that use an atid to be used as shared
CPLs, although ACT_OPEN_RPL is the only one being converted in this
revision.
Sponsored by: Chelsio Communications
intended recipient of a CPL when it can't be determined solely from the
opcode. Retire the per-queue handlers for such CPLs in favor of the new
scheme.
Sponsored by: Chelsio Communications
Previously, get_keyid() was returning the address of the receive key
instead of the transmit key when renegotiating the transmit key. This
could either hang the card (if a connection was only offloading TLS TX
and thus had a receive key address of -1) or cause the connection to
fail by overwriting the wrong key (if both RX and TX TLS were
offloaded).
Submitted by: Harsh Jain @ Chelsio
Sponsored by: Chelsio Communications
Retrieve the tag from the correct ifnet and use the provided tag
(instead of hardcoded 0xffff, implying no tag) in the routines that
process offload policy.
Submitted by: Krishnamraju Eraparaju @ Chelsio
Sponsored by: Chelsio Communications
COP allows fine-grained control on whether to offload a TCP connection
using t4_tom, and what settings to apply to a connection selected for
offload. t4_tom must still be loaded and IFCAP_TOE must still be
enabled for full TCP offload to take place on an interface. The
difference is that IFCAP_TOE used to be the only knob and would enable
TOE for all new connections on the inteface, but now the driver will
also consult the COP, if any, before offloading to the hardware TOE.
A policy is a plain text file with any number of rules, one per line.
Each rule has a "match" part consisting of a socket-type (L = listen,
A = active open, P = passive open, D = don't care) and a pcap-filter(7)
expression, and a "settings" part that specifies whether to offload the
connection or not and the parameters to use if so. The general format
of a rule is: [socket-type] expr => settings
Example. See cxgbetool(8) for more information.
[L] ip && port http => offload
[L] port 443 => !offload
[L] port ssh => offload
[P] src net 192.168/16 && dst port ssh => offload !nagle !timestamp cong newreno
[P] dst port ssh => offload !nagle ecn cong tahoe
[P] dst port http => offload
[A] dst port 443 => offload tls
[A] dst net 192.168/16 => offload !timestamp cong highspeed
The driver processes the rules for each new listen, active open, or
passive open and stops at the first match. There is an implicit rule at
the end of every policy that prohibits offload when no rule in the
policy matches:
[D] all => !offload
This is a reworked and expanded version of a patch submitted by
Krishnamraju Eraparaju @ Chelsio.
Sponsored by: Chelsio Communications
The TCB is read using a memory window right now. A better alternate to
get self-consistent, uncached information would be to use a GET_TCB
request but waiting for a reply from hw while holding non-sleepable
locks is quite inconvenient.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14817
Requests to modify the state of TLS connections need to be sent on the
same queue as TLS record transmit requests to ensure ordering.
However, in order to use the offload transmit queue in t4_set_tcb_field(),
the function needs to be updated to do proper flow control / credit
management when queueing a request to an offload queue. This required
passing a pointer to the toepcb itself to this function, so while here
remove the 'tid' and 'iqid' parameters and obtain those values from the
toepcb in t4_set_tcb_field() itself.
Submitted by: Harsh Jain @ Chelsio (original version)
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14871
Compare sbavail() with the cached sb_off of already-sent data instead of
always comparing with zero. This will correctly close the connection and
send the FIN if the socket buffer contains some previously-sent data but
no unsent data.
Reported by: Harsh Jain @ Chelsio
Sponsored by: Chelsio Communications
- Remove the one use of is_tls_offload() and the function. AIO special
handling only needs to be disabled when a TOE socket is actively doing
TLS offload on transmit. The TOE socket's mode (which affects receive
operation) doesn't matter, so remove the check for the socket's mode and
only check if a TOE socket has TLS transmit keys configured to determine
if an AIO write request should fall back to the normal socket handling
instead of the TOE fast path.
- Move can_tls_offload() into t4_tls.c. It is not used in critical paths,
so inlining isn't that important. Change return type to bool while here.
Sponsored by: Chelsio Communications
The TOE engine in Chelsio T6 adapters supports offloading of TLS
encryption and TCP segmentation for offloaded connections. Sockets
using TLS are required to use a set of custom socket options to upload
RX and TX keys to the NIC and to enable RX processing. Currently
these socket options are implemented as TCP options in the vendor
specific range. A patched OpenSSL library will be made available in a
port / package for use with the TLS TOE support.
TOE sockets can either offload both transmit and reception of TLS
records or just transmit. TLS offload (both RX and TX) is enabled by
setting the dev.t6nex.<x>.tls sysctl to 1 and requires TOE to be
enabled on the relevant interface. Transmit offload can be used on
any "normal" or TLS TOE socket by using the custom socket option to
program a transmit key. This permits most TOE sockets to
transparently offload TLS when applications use a patched SSL library
(e.g. using LD_LIBRARY_PATH to request use of a patched OpenSSL
library). Receive offload can only be used with TOE sockets using the
TLS mode. The dev.t6nex.0.toe.tls_rx_ports sysctl can be set to a
list of TCP port numbers. Any connection with either a local or
remote port number in that list will be created as a TLS socket rather
than a plain TOE socket. Note that although this sysctl accepts an
arbitrary list of port numbers, the sysctl(8) tool is only able to set
sysctl nodes to a single value. A TLS socket will hang without
receiving data if used by an application that is not using a patched
SSL library. Thus, the tls_rx_ports node should be used with care.
For a server mostly concerned with offloading TLS transmit, this node
is not needed as plain TOE sockets will fall back to software crypto
when using an unpatched SSL library.
New per-interface statistics nodes are added giving counts of TLS
packets and payload bytes (payload bytes do not include TLS headers or
authentication tags/MACs) offloaded via the TOE engine, e.g.:
dev.cc.0.stats.rx_tls_octets: 149
dev.cc.0.stats.rx_tls_records: 13
dev.cc.0.stats.tx_tls_octets: 26501823
dev.cc.0.stats.tx_tls_records: 1620
TLS transmit work requests are constructed by a new variant of
t4_push_frames() called t4_push_tls_records() in tom/t4_tls.c.
TLS transmit work requests require a buffer containing IVs. If the
IVs are too large to fit into the work request, a separate buffer is
allocated when constructing a work request. This buffer is associated
with the transmit descriptor and freed when the descriptor is ACKed by
the adapter.
Received TLS frames use two new CPL messages. The first message is a
CPL_TLS_DATA containing the decryped payload of a single TLS record.
The handler places the mbuf containing the received payload on an
mbufq in the TOE pcb. The second message is a CPL_RX_TLS_CMP message
which includes a copy of the TLS header and indicates if there were
any errors. The handler for this message places the TLS header into
the socket buffer followed by the saved mbuf with the payload data.
Both of these handlers are contained in tom/t4_tls.c.
A few routines were exposed from t4_cpl_io.c for use by t4_tls.c
including send_rx_credits(), a new send_rx_modulate(), and
t4_close_conn().
TLS keys for both transmit and receive are stored in onboard memory
in the NIC in the "TLS keys" memory region.
In some cases a TLS socket can hang with pending data available in the
NIC that is not delivered to the host. As a workaround, TLS sockets
are more aggressive about sending CPL_RX_DATA_ACK messages anytime that
any data is read from a TLS socket. In addition, a fallback timer will
periodically send CPL_RX_DATA_ACK messages to the NIC for connections
that are still in the handshake phase. Once the connection has
finished the handshake and programmed RX keys via the socket option,
the timer is stopped.
A new function select_ulp_mode() is used to determine what sub-mode a
given TOE socket should use (plain TOE, DDP, or TLS). The existing
set_tcpddp_ulp_mode() function has been renamed to set_ulp_mode() and
handles initialization of TLS-specific state when necessary in
addition to DDP-specific state.
Since TLS sockets do not receive individual TCP segments but always
receive full TLS records, they can receive more data than is available
in the current window (e.g. if a 16k TLS record is received but the
socket buffer is itself 16k). To cope with this, just drop the window
to 0 when this happens, but track the overage and "eat" the overage as
it is read from the socket buffer not opening the window (or adding
rx_credits) for the overage bytes.
Reviewed by: np (earlier version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14529
- Change t4_ddp_mod_load() to return void instead of always returning
success. This avoids having to pretend to have proper support for
unloading when only part of t4_tom_mod_load() has run.
- If t4_register_uld() fails, don't invoke t4_tom_mod_unload() directly.
The module handling code in the kernel invokes MOD_UNLOAD on a module
whose MOD_LOAD fails with an error already.
Reviewed by: np (part of a larger patch)
MFC after: 1 month
Sponsored by: Chelsio Communications
This consolidates all of the DDP state in one place. Also, the code has
now been fixed to ensure that DDP state is only accessed for DDP
connections. This should not be a functional change but makes it cleaner
and easier to add state for other TOE socket modes in the future.
MFC after: 1 month
Sponsored by: Chelsio Communications
This used to work by accident with ld.bfd even though always_keepalive
was marked as static. LLD honors static more correctly, so export this
variable properly (including moving it into the tcp_* namespace).
Reviewed by: bz, emaste
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14129
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
r324539 gathered some vnet decls into netinet/tcp_var.h, so that they
are now redundant in dev/cxgbe/tom/{t4_cpl_io.c,t4_ddp.c}. This triggers
gcc -Wredundant-decls.
Reviewed by: np
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12674
All of these arguments are stored in m_ext, so there is no reason
to pass them in the argument list. Not all functions need the second
argument, some don't even need the first one. The second argument
lives in next cache line, so not dereferencing it is a performance
gain. This was discovered in sendfile(2), which will be covered by
next commits.
The second goal of this commit is to bring even more flexibility
to m_ext mbufs, allowing to create more fields in m_ext, opaque to
the generic mbuf code, and potentially set and dereferenced by
subsystems.
Reviewed by: gallatin, kbowling
Differential Revision: https://reviews.freebsd.org/D12615
To optimize the case of ping-ponging between two buffers, the DDP code
caches the last two buffers used keeping the pages wired and page pods
stored in the NIC's RAM. If a new aio_read() request uses one of the
same buffers, then the work of holding pages, etc. can be avoided.
However, the starting virtual address of an aio buffer was not saved,
only the page count, length, and initial page offset. Thus, an
aio_read() request could match a different buffer in the address
space. (Earlier during development vm_fault_hold_quick_pages() was
always called and the vm_page_t values were compared, but that was
eventually removed without being adequately replaced.) Fix by storing
the starting virtual address and comparing that (along with other
fields) to determine if a buffer can be reused.
MFC after: 3 days
Sponsored by: Chelsio Communications
used by the TOE hardware for fully offloaded connections. The knob
affects new connections only.
MFC after: 2 weeks
Sponsored by: Chelsio Communications