events.
This is primarily for the TX EDMA and TX EDMA completion. I haven't yet
tied it into the EDMA RX path or the legacy TX/RX path.
Things that I don't quite like:
* Make the pointer type 'void' in ath_softc and have if_ath_alq*()
return a malloc'ed buffer. That would remove the need to include
if_ath_alq.h in if_athvar.h.
* The sysctl setup needs to be cleaned up.
Right now processing a full 512 frame queue takes quite a while (measured
on the order of milliseconds.) Because of this, the TX processing ends up
sometimes preempting the taskqueue:
* userland sends a frame
* it goes in through net80211 and out to ath_start()
* ath_start() will end up either direct dispatching or software queuing a
frame.
If TX had to wait for RX to finish, it would add quite a few ms of
additional latency to the packet transmission. This in the past has
caused issues with TCP throughput.
Now, as part of my attempt to bring sanity to the TX/RX paths, the first
step is to make the RX processing happen in smaller 'parts'. That way
when TX is pushed into the ath taskqueue, there won't be so much latency
in the way of things.
The bigger scale change (which will come much later) is to actually
process the frames in the ath_intr taskqueue but process _frames_ in
the ath driver taskqueue. That would reduce the latency between
processing and requeuing new descriptors. But that'll come later.
The actual work:
* Add ATH_RX_MAX at 128 (static for now);
* break out of the processing loop if npkts reaches ATH_RX_MAX;
* if we processed ATH_RX_MAX or more frames during the processing loop,
immediately reschedule another RX taskqueue run. This will handle
the further frames in the taskqueue.
This should have very minimal impact on the general throughput case,
unless the scheduler is being very very strange or the ath taskqueue
ends up spending a lot of time on non-RX operations (such as TX
completion.)
This should eventually be unified with ATH_DEBUG() so I can get both
from one macro; that may take some time.
Add some new probes for TX and TX completion.
The AR9300 and later descriptors are 128 bytes, however I'd like to make
sure that isn't used for earlier chips.
* Populate the TX descriptor length field in the softc with
sizeof(ath_desc)
* Use this field when allocating the TX descriptors
* Pre-AR93xx TX/RX descriptors will use the ath_desc size; newer ones will
query the HAL for these sizes.
The AR9003 series NICs implement a separate RX error to signal that a
Keycache miss occured. The earlier NICs would not set the key index
valid bit.
I'll dig into the difference between "no key index bit set" and "keycache
miss".
The AR93xx and later chips support two RX FIFO queues - a high and low
priority queue.
For legacy chips, just assume the queues are high priority.
This is inspired by the reference driver but is a reimplementation of
the API and code.
The RX EDMA support requires a modified approach to the RX descriptor
handling.
Specifically:
* There's now two RX queues - high and low priority;
* The RX queues are implemented as FIFOs; they're now an array of pointers
to buffers;
* .. and the RX buffer and descriptor are in the same "buffer", rather than
being separate.
So to that end, this commit abstracts out most of the RX related functions
from the bulk of the driver. Notably, the RX DMA/buffer allocation isn't
updated, primarily because I haven't yet fleshed out what it should look
like.
Whilst I'm here, create a set of matching but mostly unimplemented EDMA
stubs.
Tested:
* AR9280, station mode
TODO:
* Thorough AP and other mode testing for non-EDMA chips;
* Figure out how to allocate RX buffers suitable for RX EDMA, including
correctly setting the mbuf length to compensate for the RX descriptor
and completion status area.
a buffer pointer.
For large radar pulses, the AR9130 and later will return a series of
FFT results for software processing. These can overflow a single 2KB
buffer on longer pulses. This would result in undefined buffer behaviour.
This includes a few new fields in each RXed frame:
* per chain RX RSSI (ctl and ext);
* current RX chainmask;
* EVM information;
* PHY error code;
* basic RX status bits (CRC error, PHY error, etc).
This is primarily to allow me to do some userland PHY error processing
for radar and spectral scan data. However since EVM and per-chain RSSI
is provided, others may find it useful for a variety of tasks.
The default is to not compile in the radiotap vendor extensions, primarily
because tcpdump doesn't seem to handle the particular vendor extension
layout I'm using, and I'd rather not break existing code out there that
may be (badly) parsing the radiotap data.
Instead, add the option 'ATH_ENABLE_RADIOTAP_VENDOR_EXT' to your kernel
configuration file to enable these options.
called to "kick" along TX.
For now, schedule a taskqueue call.
Later on I may go back to the direct call of ath_rx_tasklet() - but for
now, this will do.
I've tested UDP and TCP TX. UDP TX still achieves 240MBit, but TCP
TX gets stuck at around 100MBit or so, instead of the 150MBit it should
be at. I'll re-test with no ACPI/power/sleep states enabled at startup
and see what effect it has.
This is in preparation for supporting an if_transmit() path, which will
turn ath_tx_kick() into a NUL operation (as there won't be an ifnet
queue to service.)
Tested:
* AR9280 STA
TODO:
* test on AR5416, AR9160, AR928x STA/AP modes
PR: kern/168649
implementing parallel TX and TX/RX completion can be done without
simply abusing long-held locks.
Right now, multiple concurrent ath_start() entries can result in
frames being dequeued out of order. Well, they're dequeued in order
fine, but if there's any preemption or race between CPUs between:
* removing the frame from the ifnet, and
* calling and runningath_tx_start(), until the frame is placed on a
software or hardware TXQ
Then although dequeueing the frame is in-order, queueing it to the hardware
may be out of order.
This is solved in a lot of other drivers by just holding a TX lock over
a rather long period of time. This lets them continue to direct dispatch
without races between dequeue and hardware queue.
Note to observers: if_transmit() doesn't necessarily solve this.
It removes the ifnet from the main path, but the same issue exists if
there's some intermediary queue (eg a bufring, which as an aside also
may pull in ifnet when you're using ALTQ.)
So, until I can sit down and code up a much better way of doing parallel
TX, I'm going to leave the TX path using a deferred taskqueue task.
What I will likely head towards is doing a direct dispatch to hardware
or software via if_transmit(), but it'll require some driver changes to
allow queues to be made without using the really large ath_buf / ath_desc
entries.
TODO:
* Look at how feasible it'll be to just do direct dispatch to
ath_tx_start() from if_transmit(), avoiding doing _any_ intermediary
serialisation into a global queue. This may break ALTQ for example,
so I have to be delicate.
* It's quite likely that I should break up ath_tx_start() so it
deposits frames onto the software queues first, and then only fill
in the 802.11 fields when it's being queued to the hardware.
That will make the if_transmit() -> software queue path very
quick and lightweight.
* This has some very bad behaviour when using ACPI and Cx states.
I'll do some subsequent analysis using KTR and schedgraph and file
a follow-up PR or two.
PR: kern/168649
There's some TX path TDMA code in if_ath_tx.c which should be migrated
out, but first I should likely try and verify/fix/repair the TDMA support
in 9.x and -HEAD.
* migrate the rx processing out into if_ath_rx.c
* migrate the TSF functions into if_ath_tsf.h, as inlines
This is in prepration for supporting the EDMA RX routines, required to
support the AR93xx series NICs.
TODO:
* ath_start() shouldn't be private, but it's called as part of
the RX path. I should likely migrate ath_rx_tasklet() back into
if_ath.c and then return this to be 'static'. The RX code really
shouldn't need to see TX routines (and vice versa.)
* ath_beacon_* should be in if_ath_beacon.[ch].
* ath_tdma_* should be in if_ath_tdma.[ch] ...