a) Front load as much work as possible in if_transmit, before any driver
lock or software queue has to get involved.
b) Replace buf_ring with a brand new mp_ring (multiproducer ring). This
is specifically for the tx multiqueue model where one of the if_transmit
producer threads becomes the consumer and other producers carry on as
usual. mp_ring is implemented as standalone code and it should be
possible to use it in any driver with tx multiqueue. It also has:
- the ability to enqueue/dequeue multiple items. This might become
significant if packet batching is ever implemented.
- an abdication mechanism to allow a thread to give up writing tx
descriptors and have another if_transmit thread take over. A thread
that's writing tx descriptors can end up doing so for an unbounded
time period if a) there are other if_transmit threads continuously
feeding the sofware queue, and b) the chip keeps up with whatever the
thread is throwing at it.
- accurate statistics about interesting events even when the stats come
at the expense of additional branches/conditional code.
The NIC txq lock is uncontested on the fast path at this point. I've
left it there for synchronization with the control events (interface
up/down, modload/unload).
c) Add support for "type 1" coalescing work request in the normal NIC tx
path. This work request is optimized for frames with a single item in
the DMA gather list. These are very common when forwarding packets.
Note that netmap tx in cxgbe already uses these "type 1" work requests.
d) Do not request automatic cidx updates every 32 descriptors. Instead,
request updates via bits in individual work requests (still every 32
descriptors approximately). Also, request an automatic final update
when the queue idles after activity. This means NIC tx reclaim is still
performed lazily but it will catch up quickly as soon as the queue
idles. This seems to be the best middle ground and I'll probably do
something similar for netmap tx as well.
e) Implement a faster tx path for WRQs (used by TOE tx and control
queues, _not_ by the normal NIC tx). Allow work requests to be written
directly to the hardware descriptor ring if room is available. I will
convert t4_tom and iw_cxgbe modules to this faster style gradually.
MFC after: 2 months
options into kern.opts.mk and change all the places where we use
src.opts.mk to pull in the options. Conditionally define SYSDIR and
use SYSDIR/conf/kern.opts.mk instead of a CURDIR path. Replace all
instances of CURDIR/../../etc with STSDIR, but only in the affected
files.
As a special compatibility hack, include bsd.owm.mk at the top of
kern.opts.mk to allow the bare build of sys/modules to work on older
systems. If the defaults ever change between 9.x, 10.x and current for
these options, however, you'll wind up with the host OS' defaults
rather than the -current defaults. This hack will be removed when
we no longer need to support this build scenario.
Reviewed by: jhb
Differential Revision: https://phabric.freebsd.org/D529
opt_inet6.h into kmod.mk by forcing almost everybody to eat the same
dogfood. While at it, consolidate the opt_bpf.h and opt_mroute.h
targets here too.
Netmap gets its own hardware-assisted virtual interface and won't take
over or disrupt the "normal" interface in any way. You can use both
simultaneously.
For kernels with DEV_NETMAP, cxgbe(4) carves out an ncxl<N> interface
(note the 'n' prefix) in the hardware to accompany each cxl<N>
interface. These two ifnet's per port share the same wire but really
are separate interfaces in the hardware and software. Each gets its own
L2 MAC addresses (unicast and multicast), MTU, checksum caps, etc. You
should run netmap on the 'n' interfaces only, that's what they are for.
With this, pkt-gen is able to transmit > 45Mpps out of a single 40G port
of a T580 card. 2 port tx is at ~56Mpps total (28M + 28M) as of now.
Single port receive is at 33Mpps but this is very much a work in
progress. I expect it to be closer to 40Mpps once done. In any case
the current effort can already saturate multiple 10G ports of a T5 card
at the smallest legal packet size. T4 gear is totally untested.
trantor:~# ./pkt-gen -i ncxl0 -f tx -D 00:07:43🆎cd:ef
881.952141 main [1621] interface is ncxl0
881.952250 extract_ip_range [275] range is 10.0.0.1:0 to 10.0.0.1:0
881.952253 extract_ip_range [275] range is 10.1.0.1:0 to 10.1.0.1:0
881.962540 main [1804] mapped 334980KB at 0x801dff000
Sending on netmap:ncxl0: 4 queues, 1 threads and 1 cpus.
10.0.0.1 -> 10.1.0.1 (00:00:00:00:00:00 -> 00:07:43🆎cd:ef)
881.962562 main [1882] Sending 512 packets every 0.000000000 s
881.962563 main [1884] Wait 2 secs for phy reset
884.088516 main [1886] Ready...
884.088535 nm_open [457] overriding ifname ncxl0 ringid 0x0 flags 0x1
884.088607 sender_body [996] start
884.093246 sender_body [1064] drop copy
885.090435 main_thread [1418] 45206353 pps (45289533 pkts in 1001840 usec)
886.091600 main_thread [1418] 45322792 pps (45375593 pkts in 1001165 usec)
887.092435 main_thread [1418] 45313992 pps (45351784 pkts in 1000834 usec)
888.094434 main_thread [1418] 45315765 pps (45406397 pkts in 1002000 usec)
889.095434 main_thread [1418] 45333218 pps (45378551 pkts in 1001000 usec)
890.097434 main_thread [1418] 45315247 pps (45405877 pkts in 1002000 usec)
891.099434 main_thread [1418] 45326515 pps (45417168 pkts in 1002000 usec)
892.101434 main_thread [1418] 45333039 pps (45423705 pkts in 1002000 usec)
893.103434 main_thread [1418] 45324105 pps (45414708 pkts in 1001999 usec)
894.105434 main_thread [1418] 45318042 pps (45408723 pkts in 1002001 usec)
895.106434 main_thread [1418] 45332430 pps (45377762 pkts in 1001000 usec)
896.107434 main_thread [1418] 45338072 pps (45383410 pkts in 1001000 usec)
...
Relnotes: Yes
Sponsored by: Chelsio Communications.
all T4 and T5 based cards and is useful for analyzing TSO, LRO, TOE, and
for general purpose monitoring without tapping any cxgbe or cxl ifnet
directly.
Tracers on the T4/T5 chips provide access to Ethernet frames exactly as
they were received from or transmitted on the wire. On transmit, a
tracer will capture a frame after TSO segmentation, hw VLAN tag
insertion, hw L3 & L4 checksum insertion, etc. It will also capture
frames generated by the TCP offload engine (TOE traffic is normally
invisible to the kernel). On receive, a tracer will capture a frame
before hw VLAN extraction, runt filtering, other badness filtering,
before the steering/drop/L2-rewrite filters or the TOE have had a go at
it, and of course before sw LRO in the driver.
There are 4 tracers on a chip. A tracer can trace only in one direction
(tx or rx). For now cxgbetool will set up tracers to capture the first
128B of every transmitted or received frame on a given port. This is a
small subset of what the hardware can do. A pseudo ifnet with the same
name as the nexus driver (t4nex0 or t5nex0) will be created for tracing.
The data delivered to this ifnet is an additional copy made inside the
chip. Normal delivery to cxgbe<n> or cxl<n> will be made as usual.
/* watch cxl0, which is the first port hanging off t5nex0. */
# cxgbetool t5nex0 tracer 0 tx0 (watch what cxl0 is transmitting)
# cxgbetool t5nex0 tracer 1 rx0 (watch what cxl0 is receiving)
# cxgbetool t5nex0 tracer list
# tcpdump -i t5nex0 <== all that cxl0 sees and puts on the wire
If you were doing TSO, a tcpdump on cxl0 may have shown you ~64K
"frames" with no L3/L4 checksum but this will show you the frames that
were actually transmitted.
/* all done */
# cxgbetool t5nex0 tracer 0 disable
# cxgbetool t5nex0 tracer 1 disable
# cxgbetool t5nex0 tracer list
# ifconfig t5nex0 destroy
includes support for the NIC and TOE features of the 40G, 10G, and
1G/100M cards based on the T5.
The ASIC is mostly backward compatible with the Terminator 4 so cxgbe(4)
has been updated instead of writing a brand new driver. T5 cards will
show up as cxl (short for cxlgb) ports attached to the t5nex bus driver.
Sponsored by: Chelsio
- Add full support for IPv6 addresses.
- Read the size of the L2 table during attach. Do not assume that PCIe
physical function 4 of the card has all of the table to itself.
- Use FNV instead of Jenkins to hash L3 addresses and drop the private
copy of jhash.h from the driver.
MFC after: 1 week
Basically, this is automatic rx zero copy when feasible. TCP payload is
DMA'd directly into the userspace buffer described by the uio submitted
in soreceive by an application.
- Works with sockets that are being handled by the TCP offload engine
of a T4 chip (you need t4_tom.ko module loaded after cxgbe, and an
"ifconfig +toe" on the cxgbe interface).
- Does not require any modification to the application.
- Not enabled by default. Use hw.t4nex.<X>.toe.ddp="1" to enable it.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.
- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.
Build-tested with make universe.
30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE
Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe
Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe
Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
- Device configuration via plain text config file. Also able to operate
when not attached to the chip as the master driver.
- Generic "work request" queue that serves as the base for both ctrl and
ofld tx queues.
- Generic interrupt handler routine that can process any event on any
kind of ingress queue (via a dispatch table).
- A couple of new driver ioctls. cxgbetool can now install a firmware
to the card ("loadfw" command) and can read the card's memory
("memdump" and "tcb" commands).
- Lots of assorted information within dev.t4nex.X.misc.* This is
primarily for debugging and won't show up in sysctl -a.
- Code to manage the L2 tables on the chip.
- Updates to cxgbe(4) man page to go with the tunables that have changed.
- Updates to the shared code in common/
- Updates to the driver-firmware interface (now at fw 1.4.16.0)
MFC after: 1 month
filters working. (All other filters - switch without L2 info rewrite,
steer, and drop - were already fully-functional).
Some contrived examples of "switch" filters with L2 rewriting:
# cxgbetool t4nex0 iport 0 dport 80 action switch vlan +9 eport 3
Intercept all packets received on physical port 0 with TCP port 80 as
destination, insert a vlan tag with VID 9, and send them out of port 3.
# cxgbetool t4nex0 sip 192.168.1.1/32 ivlan 5 action switch \
vlan =9 smac aa:bb:cc:dd:ee:ff eport 0
Intercept all packets (received on any port) with source IP address
192.168.1.1 and VLAN id 5, rewrite the VLAN id to 9, rewrite source mac
to aa:bb:cc:dd:ee:ff, and send it out of port 0.
MFC after: 1 week
Reference code that shows how to get a packet's timestamp out of
cxgbe(4). Disabled by default because we don't have a standard way
today to pass this information up the stack.
The timestamp is 60 bits wide and each increment represents 1 tick of
the T4's core clock. As an example, the timestamp granularity is ~4.4ns
for this card:
# sysctl dev.t4nex.0.core_clock
dev.t4nex.0.core_clock: 228125
MFC after: 1 week