all T4 and T5 based cards and is useful for analyzing TSO, LRO, TOE, and
for general purpose monitoring without tapping any cxgbe or cxl ifnet
directly.
Tracers on the T4/T5 chips provide access to Ethernet frames exactly as
they were received from or transmitted on the wire. On transmit, a
tracer will capture a frame after TSO segmentation, hw VLAN tag
insertion, hw L3 & L4 checksum insertion, etc. It will also capture
frames generated by the TCP offload engine (TOE traffic is normally
invisible to the kernel). On receive, a tracer will capture a frame
before hw VLAN extraction, runt filtering, other badness filtering,
before the steering/drop/L2-rewrite filters or the TOE have had a go at
it, and of course before sw LRO in the driver.
There are 4 tracers on a chip. A tracer can trace only in one direction
(tx or rx). For now cxgbetool will set up tracers to capture the first
128B of every transmitted or received frame on a given port. This is a
small subset of what the hardware can do. A pseudo ifnet with the same
name as the nexus driver (t4nex0 or t5nex0) will be created for tracing.
The data delivered to this ifnet is an additional copy made inside the
chip. Normal delivery to cxgbe<n> or cxl<n> will be made as usual.
/* watch cxl0, which is the first port hanging off t5nex0. */
# cxgbetool t5nex0 tracer 0 tx0 (watch what cxl0 is transmitting)
# cxgbetool t5nex0 tracer 1 rx0 (watch what cxl0 is receiving)
# cxgbetool t5nex0 tracer list
# tcpdump -i t5nex0 <== all that cxl0 sees and puts on the wire
If you were doing TSO, a tcpdump on cxl0 may have shown you ~64K
"frames" with no L3/L4 checksum but this will show you the frames that
were actually transmitted.
/* all done */
# cxgbetool t5nex0 tracer 0 disable
# cxgbetool t5nex0 tracer 1 disable
# cxgbetool t5nex0 tracer list
# ifconfig t5nex0 destroy
when the interface is up.
- Add a tunable to control the TOE's rx coalesce feature (enabled by
default as it always has been). Consider the interface MTU or the
coalesce size when deciding which cluster zone to use to fill the
offload rx queue's free list. The tunable is:
dev.{t4nex,t5nex}.<N>.toe.rx_coalesce
MFC after: 1 day
includes support for the NIC and TOE features of the 40G, 10G, and
1G/100M cards based on the T5.
The ASIC is mostly backward compatible with the Terminator 4 so cxgbe(4)
has been updated instead of writing a brand new driver. T5 cards will
show up as cxl (short for cxlgb) ports attached to the t5nex bus driver.
Sponsored by: Chelsio
CSUM_TCP too. They are all set explicitly by the kernel usually.
While here, fix an unrelated bug where hardware L4 checksum calculation
was accidentally disabled for some IPv6 packets.
Reported by: alfred@
MFC after: 3 days
- Setup multiple DDP page sizes. When the driver attempts DDP it will
try to combine physically contiguous pages into regions of these sizes.
- Set the indicate size such that the payload carried in the indicate can
be copied in the header mbuf (and the 16K rx buffer can be recycled).
- Set DDP threshold to the max payload that the chip will coalesce and
deliver to the driver (this is ~16K by default, which is also why the
offload rx queue is backed by 16K buffers). If the chip is able to
coalesce up to the max it's allowed to, it's a good sign that the peer
is transmitting in bulk without any TCP PSH.
MFC after: 2 weeks
interface's MTU. Initialize such freelists with correct values.
This wasn't a problem for common MTUs (1500 and 9000) as the buffers (2048
and 9216 in size) happened to have enough spare room. I ran into it when
playing around with unusual MTUs.
MFC after: 2 weeks
values).
- cong_drop specifies what to do on congestion: nothing, backpressure,
or drop.
- fl_pktshift specifies the padding before Ethernet payload.
- fl_pad specifies the boundary upto which to pad Ethernet payload.
- spg_len controls the length of the status page.
MFC after: 2 weeks
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.
- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.
Build-tested with make universe.
30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE
Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe
Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe
Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
- Device configuration via plain text config file. Also able to operate
when not attached to the chip as the master driver.
- Generic "work request" queue that serves as the base for both ctrl and
ofld tx queues.
- Generic interrupt handler routine that can process any event on any
kind of ingress queue (via a dispatch table).
- A couple of new driver ioctls. cxgbetool can now install a firmware
to the card ("loadfw" command) and can read the card's memory
("memdump" and "tcb" commands).
- Lots of assorted information within dev.t4nex.X.misc.* This is
primarily for debugging and won't show up in sysctl -a.
- Code to manage the L2 tables on the chip.
- Updates to cxgbe(4) man page to go with the tunables that have changed.
- Updates to the shared code in common/
- Updates to the driver-firmware interface (now at fw 1.4.16.0)
MFC after: 1 month
queues. Try to have a set of these per port when possible, fall back
to sharing a common pool between all ports otherwise.
- One control queue per port (used to be one per hardware channel).
- t4_eth_rx now handles Ethernet rx only.
- sysctls to display pidx/cidx for some queues.
MFC after: 1 week
Reference code that shows how to get a packet's timestamp out of
cxgbe(4). Disabled by default because we don't have a standard way
today to pass this information up the stack.
The timestamp is 60 bits wide and each increment represents 1 tick of
the T4's core clock. As an example, the timestamp granularity is ~4.4ns
for this card:
# sysctl dev.t4nex.0.core_clock
dev.t4nex.0.core_clock: 228125
MFC after: 1 week
- Enable 5-tuple and every-packet lookup.
- Setup the default filter mode to allow filtering/steering based on IP
protocol, ingress port, inner VLAN ID, IP frag, FCoE, and MPS match
type; all combined together. You can also filter based on MAC index,
Ethernet type, IP TOS/IPv6 Traffic Class, and outer VLAN ID but you'll
have to modify the default filter mode and exclude some of the
match-fields in it.
IPv4 and IPv6 SIP/DIP/SPORT/DPORT are always available in all filter
rules.
- Add driver ioctls to get/set the global filter mode.
- Add driver ioctls to program and delete hardware filters. A couple of
the "switch" actions that rewrite Ethernet and VLAN information and
switch the packet out of another port may not work as the L2 code is not
yet in place. Everything else, including all "drop" and "pass" rules
with RSS or absolute qid, should work.
Obtained from: Chelsio Communications
that could have allowed the hardware pidx to reach the cidx even though
the freelist isn't empty. (Haven't actually seen this but it was there
waiting to happen..)
MFC after: 1 week
now a suitable base for all kinds of egress queues.
- Add control queues (sge_ctrlq) and allocate one of these per hardware
channel. They can be used to program filters and steer traffic (and
more).
MFC after: 1 week
down. The ingress queue lock was unused and has been removed as part of
these changes.
- An in-flight egress update from the SGE must be handled before the
queue that requested it is destroyed. Wait for the update to arrive.
- Interrupt handlers must stop processing rx events for a queue before
the queue is destroyed. Events that have not yet been processed
should be ignored once the queue disappears.
MFC after: 1 week
queue has its own interrupt. If the exact number that we need is not a
power of 2 and we're using MSI, then switch to interrupt multiplexing.
While here, replace the magic numbers with something more readable.
MFC after: 3 days