it turns out that it negatively affects performance. I'm stil investigating
exactly why deferring the IO causes such negative TCP performance but
doesn't affect UDP preformance.
Leave the ath_tx_kick() change in there however; it's going to be useful
to have that there for if_transmit() work.
PR: kern/168649
called to "kick" along TX.
For now, schedule a taskqueue call.
Later on I may go back to the direct call of ath_rx_tasklet() - but for
now, this will do.
I've tested UDP and TCP TX. UDP TX still achieves 240MBit, but TCP
TX gets stuck at around 100MBit or so, instead of the 150MBit it should
be at. I'll re-test with no ACPI/power/sleep states enabled at startup
and see what effect it has.
This is in preparation for supporting an if_transmit() path, which will
turn ath_tx_kick() into a NUL operation (as there won't be an ifnet
queue to service.)
Tested:
* AR9280 STA
TODO:
* test on AR5416, AR9160, AR928x STA/AP modes
PR: kern/168649
implementing parallel TX and TX/RX completion can be done without
simply abusing long-held locks.
Right now, multiple concurrent ath_start() entries can result in
frames being dequeued out of order. Well, they're dequeued in order
fine, but if there's any preemption or race between CPUs between:
* removing the frame from the ifnet, and
* calling and runningath_tx_start(), until the frame is placed on a
software or hardware TXQ
Then although dequeueing the frame is in-order, queueing it to the hardware
may be out of order.
This is solved in a lot of other drivers by just holding a TX lock over
a rather long period of time. This lets them continue to direct dispatch
without races between dequeue and hardware queue.
Note to observers: if_transmit() doesn't necessarily solve this.
It removes the ifnet from the main path, but the same issue exists if
there's some intermediary queue (eg a bufring, which as an aside also
may pull in ifnet when you're using ALTQ.)
So, until I can sit down and code up a much better way of doing parallel
TX, I'm going to leave the TX path using a deferred taskqueue task.
What I will likely head towards is doing a direct dispatch to hardware
or software via if_transmit(), but it'll require some driver changes to
allow queues to be made without using the really large ath_buf / ath_desc
entries.
TODO:
* Look at how feasible it'll be to just do direct dispatch to
ath_tx_start() from if_transmit(), avoiding doing _any_ intermediary
serialisation into a global queue. This may break ALTQ for example,
so I have to be delicate.
* It's quite likely that I should break up ath_tx_start() so it
deposits frames onto the software queues first, and then only fill
in the 802.11 fields when it's being queued to the hardware.
That will make the if_transmit() -> software queue path very
quick and lightweight.
* This has some very bad behaviour when using ACPI and Cx states.
I'll do some subsequent analysis using KTR and schedgraph and file
a follow-up PR or two.
PR: kern/168649
There's some TX path TDMA code in if_ath_tx.c which should be migrated
out, but first I should likely try and verify/fix/repair the TDMA support
in 9.x and -HEAD.
* migrate the rx processing out into if_ath_rx.c
* migrate the TSF functions into if_ath_tsf.h, as inlines
This is in prepration for supporting the EDMA RX routines, required to
support the AR93xx series NICs.
TODO:
* ath_start() shouldn't be private, but it's called as part of
the RX path. I should likely migrate ath_rx_tasklet() back into
if_ath.c and then return this to be 'static'. The RX code really
shouldn't need to see TX routines (and vice versa.)
* ath_beacon_* should be in if_ath_beacon.[ch].
* ath_tdma_* should be in if_ath_tdma.[ch] ...
for Atheros AR5416 and later wireless devices.
This is a very large commit - the complete history can be
found in the user/adrian/if_ath_tx branch.
Legacy (ie, pre-AR5416) devices also use the per-software
TXQ support and (in theory) can support non-aggregation
ADDBA sessions. However, the net80211 stack doesn't currently
support this.
In summary:
TX path:
* queued frames normally go onto a per-TID, per-node queue
* some special frames (eg ADDBA control frames) are thrown
directly onto the relevant hardware queue so they can
go out before any software queued frames are queued.
* Add methods to create, suspend, resume and tear down an
aggregation session.
* Add in software retransmission of both normal and aggregate
frames.
* Add in completion handling of aggregate frames, including
parsing the block ack bitmap provided by the hardware.
* Write an aggregation function which can assemble frames into
an aggregate based on the selected rate control and channel
configuration.
* The per-TID queues are locked based on their target hardware
TX queue. This matches what ath9k/atheros does, and thus
simplified porting over some of the aggregation logic.
* When doing TX aggregation, stick the sequence number allocation
in the TX path rather than net80211 TX path, and protect it
by the TXQ lock.
Rate control:
* Delay rate control selection until the frame is about to
be queued to the hardware, so retried frames can have their
rate control choices changed. Frames with a static rate
control selection have that applied before each TX, just
to simplify the TX path (ie, not have "static" and "dynamic"
rate control special cased.)
* Teach ath_rate_sample about aggregates - both completion and
errors.
* Add an EWMA for tracking what the current "good" MCS rate is
based on failure rates.
Misc:
* Introduce a bunch of dirty hacks and workarounds so TID mapping
and net80211 frame inspection can be kept out of the net80211
layer. Because of the way this code works (and it's from Atheros
and Linux ath9k), there is a consistent, 1:1 mapping between
TID and AC. So we need to ensure that frames going to a specific
TID will _always_ end up on the right AC, and vice versa, or the
completion/locking will simply get very confused. I plan on
addressing this mess in the future.
Known issues:
* There is no BAR frame transmission just yet. A whole lot of
tidying up needs to occur before BAR frame TX can occur in the
"correct" place - ie, once the TID TX queue has been drained.
* Interface reset/purge/etc results in frames in the TX and RX
queues being removed. This creates holes in the sequence numbers
being assigned and the TX/RX AMPDU code (on either side) just
hangs.
* There's no filtered frame support at the present moment, so
stations going into power saving mode will simply have a number
of frames dropped - likely resulting in a traffic "hang".
* Raw frame TX is going to just not function with 11n aggregation.
Likely this needs to be modified to always override the sequence
number if the frame is going into an aggregation session.
However, general raw frame injection currently doesn't work in
general in net80211, so let's just ignore this for now until
this is sorted out.
* HT protection is just not implemented and won't be until the above
is sorted out. In addition, the AR5416 has issues RTS protecting
large aggregates (anything >8k), so the work around needs to be
ported and tested. Thus, this will be put on hold until the above
work is complete.
* The rate control module 'sample' is the only currently supported
module; onoe/amrr haven't been tested and have likely bit rotted
a little. I'll follow up with some commits to make them work again
for non-11n rates, but they won't be updated to handle 11n and
aggregation. If someone wishes to do so then they're welcome to
send along patches.
* .. and "sample" doesn't really do a good job of 11n TX. Specifically,
the metrics used (packet TX time and failure/success rates) isn't as
useful for 11n. It's likely that it should be extended to take into
account the aggregate throughput possible and then choose a rate
which maximises that. Ie, it may be acceptable for a higher MCS rate
with a higher failure to be used if it gives a more acceptable
throughput/latency then a lower MCS rate @ a lower error rate.
Again, patches will be gratefully accepted.
Because of this, ATH_ENABLE_11N is still not enabled by default.
Sponsored by: Hobnob, Inc.
Obtained from: Linux, Atheros
preparation for TX aggregation.
* Add in logic which calls ath_buf bf->bf_comp if it's set.
This allows for AMPDU (and RIFS, and FF, if someone desires) code
to handle completion - which includes freeing subframes, retransmitting
subframes, etc.
* Break out the buffer free, buffer busy/unbusy default completion handler
code into separate functions. This allows bf_comp methods to free and
unbusy each subframe ath_buf as required.
* Break out the statistics update code into a separate function, just
to clean up the TX completion path a little.
Sponsored by: Hobnob, Inc.
* Immediately return NULL if a buffer isn't available;
* Track the "buffers not available" count;
* Clear some fields used for tx aggregation;
* Add ath_buf_clone() which clones the majority of buffer state.
This is needed when retransmission of a "busy" buffer is required.
Sponsored by: Hobnob, Inc.
and interface resets to be marked as ATH_RESET_DEFAULT, ATH_RESET_FULL,
ATH_RESET_NOLOSS.
Currently a reset is still a reset - ie, all tx/rx frames in the hardware
queues are purged. This means that those frames will be lost to the 11n TX
and RX aggregation state tracking, breaking AMPDU sessions.
The (eventual) new semantics:
* ATH_RESET_DEFAULT:
full reset, this is the default for reset situations
which I haven't yet figured out what they should be.
* ATH_RESET_FULL:
A full reset - for things such as channel changes.
* ATH_RESET_NOLOSS:
Don't flush TX/RX queues - handle pending RX frames and leave TX
frames where they are; restart TX DMA from where it was.
There's two reasons for this:
* the raw and non-raw TX path shares a lot of duplicate code which should be
refactored;
* the 11n-ready chip TX path needs a little reworking.