array, similar to what filltxdesc() uses.
This removes the last reference to ds_data in the TX path outside of
debugging statements. These need to be adjusted/fixed.
Tested:
* AR9280 STA/AP with iperf TCP traffic
The existing API only exposes 'seglen' (the current buffer (segment) length)
with the data buffer pointer set in 'ds_data'. This is fine for the legacy
DMA engine but it won't work for the EDMA engines.
The EDMA engine has a significantly different TX descriptor layout.
* The legacy DMA engine had a ds_data pointer at the same offset in the
descriptor for both TX and RX buffers;
* The EDMA engine has no ds_data for RX - the data is DMAed after the
descriptor;
* The EDMA engine has support for 4 TX buffer/segment pairs in the TX
DMA descriptor;
* The EDMA TX completion is in a different FIFO, and the driver will
'link' the status completion entry to a QCU by a "QCU ID".
I don't know why it's just not filled in by the hardware, alas.
So given that, here are the changes:
* Instead of directly fondling 'ds_data' in ath_desc, change the
ath_hal_filltxdesc() to take an array of buffer pointers as well
as segment len pointers;
* The EDMA TX completion status wants a descriptor and queue id.
This (for now) uses bf_state.bfs_txq and will extract the hardware QCU
ID from that.
* .. and this is ugly and wasteful; it should change to just store
the QCU in the bf_state and save 3/7 bytes in the process.
Now, the weird crap:
* The aggregate TX path was using bf_state->bfs_txq for the TXQ, rather than
taking a function argument. I've tidied that up.
* The multicast queue frames get put on a software TXQ and then that is
appended to the hardware CABQ when appropriate. So for now, make sure
that bf_state->bfs_txq points at the CABQ when adding frames to the
multicast queue.
* .. but the multicast queue TX path for now doesn't use the software
queue and instead
(a) directly sets up the descriptor contents at that point;
(b) the frames on the vap->avp_mcastq are then just appended wholesale
to the CABQ.
So for now, I don't have to worry about making the multicast path
work with aggregation or the per-TID software queue. Phew.
What's left to do:
* I need to modify the 11n ath_hal_chaintxdesc() API to do the same.
I'll do that in a subsequent commit.
* Remove bf_state.bfs_txq entirely and store the QCU as appropriate.
* .. then do the runtime "is this going on the right HWQ?" checks using
that, rather than comparing pointer values.
Tested on:
* AR9280 STA/AP
* AR5416 STA/AP
When forming aggregates, the last descriptor was now not being
correctly setup - instead, the "setuplasttxdesc" call was being
handed the first descriptor in the last subframe, rather than the
last descriptor in the last subframe.
This showed up as "bad series0 hwrate" messages, as the final
descriptor just didn't have any of the rate control information
squirreled away.
Tested:
* AR9280 STA -> 11n AP, iperf TCP
enabled.
The legacy (pre-802.11n) hardware doesn't support this - although
the AR5212 era hardware supports MRR, it doesn't have all the bits
needed to support MRR + RTS/CTS. The AR5416 and later support
a packet duration and RTS/CTS flags per rate scenario, so we should
support it.
Tested:
* AR9280, STA
PR: kern/170302
code is called and remove it from ath_buf_set_rate().
For the legacy (non-11n API) TX routines, ath_hal_filltxdesc() takes care
of setting up the intermediary and final descriptors right, complete
with copying the rate control info into the final descriptor so the
rate modules can grab it.
The 11n version doesn't do this - ath_hal_chaintxdesc() doesn't
copy the rate control bits over, nor does it clear isaggr/moreaggr/
pad delimiters. So the call to setuplasttxdesc() is needed here.
So:
* legacy NICs - never call the 11n rate control stuff, so filltxdesc
copies the rate control info right;
* 11n NICs transmitting legacy or 11n non-aggregate frames -
ath_hal_set11nratescenario() is called to setup rate control and
then ath_hal_filltxdesc() chains them together - so the rate control
info is right;
* 11n aggregate frames - set11nratescenario() is called, then
ath_hal_chaintxdesc() is called to chain a list of aggregate and subframes
together. This requires a call to ath_hal_setuplasttxdesc() to complete
things.
Tested:
* AR9280 in station mode
TODO:
* I really should make sure that the descriptor contents get blanked
out correctly or garbage left over from aggregate frames may show
up in non-aggregate frames, leading to badness.
functions, for both legacy and 802.11n.
This will simplify supporting the EDMA chipsets as these two descriptor
setup functions can just be overridden in their entirety, hiding all of
the subtle differences in setting things up.
It's not a permanent solution, as eventually the AR5416 HAL should grow
similar versions of the 11n descriptor functions and then those can be
used.
TODO:
* Push the "clr11naggr" call into the legacy setds, just to ensure
that retried frames don't end up with the aggregate bits set
inappropriately;
* Remove the "setlasttxdesc" call from the 11n TX path and push it
into setds_11n.
* Ensure that setds_11n will work correctly for non-aggregate frames;
* .. and then when it does, just unconditionally call "setds_11n" for
11n NICs and "setds" for non-11n NICs.
These (and a few others) will differ based on the underlying DMA
implementation.
For the EDMA NICs, simply stub them out in a fashion which will let
me focus on implementing the necessary descriptor API changes.
The correct ordering for non-aggregate TX is:
* call ath_hal_setuptxdesc() to setup the first TX descriptor complete
with the first TX rate/try count;
* call ath_hal_setupxtxdesc() to setup the multi-rate retry;
* .. or for 802.11n NICs, call ath_hal_set11nratescenario() for MRR and
802.11n flags;
* then call ath_hal_filltxdesc() to setup intermediary descriptors
in a multi-descriptor single frame.
The call to ath_hal_filltxdesc() routines seem to correctly (consistently?)
handle the intermediary descriptor flags, including copying the rate
control information to the final descriptor in the frame. That's used
by the rate control module rather than the hardware.
Tested:
* Only on AR9280 STA mode, however it should work on other chips in
both STA and AP mode.
The AR9300 and later descriptors are 128 bytes, however I'd like to make
sure that isn't used for earlier chips.
* Populate the TX descriptor length field in the softc with
sizeof(ath_desc)
* Use this field when allocating the TX descriptors
* Pre-AR93xx TX/RX descriptors will use the ath_desc size; newer ones will
query the HAL for these sizes.
* Introduce TX DMA setup/teardown methods, mirroring what's done in
the RX path.
Although the TX DMA descriptor is setup via ath_desc_alloc() /
ath_desc_free(), there TX status descriptor ring will be allocated
in this path.
* Remove some of the TX EDMA capability probing from the RX path and
push it into the new TX EDMA path.
sized TX descriptor.
This is required for the AR93xx EDMA support which requires 128 byte
TX descriptors (which is significantly larger than the earlier
hardware.)
TX descriptor link pointers.
This is required for the AR93xx and later chipsets.
The RX path is slightly different - the legacy RX path directly
accesses ath_desc->ds_link for now, however this isn't at all done
for EDMA (FIFO) RX.
Now, for those performing a little software archeology here:
This is all a bit sub-optimal. "struct ath_desc" is only really relevant
for the pre-AR93xx NICs - where ds_link and ds_data is always in the
same location.
The AR93xx and later NICs have different descriptor layouts altogether.
Now, for AR93xx and later NICs, you should never directly reference
ds_link and ds_data, as:
* the RX descriptors don't have either - the data is _after_ the RX
descriptor. They're just one large buffer. There's also no need for
a per-descriptor RX buffer size as they're all fixed sizes.
* the TX descriptors have 4 buffer and 4 length fields _and_ a link
pointer. Each frame takes up one TX FIFO pointer, but it can contain
multiple subframes (either multiple frames in a buffer, and/or
multiple frames in an aggregate/RIFS burst.)
* .. so, when TX frames are queued to a hardware queue, the link
pointer is ONLY for buffers in that frame/aggregate. The next frame
starts in a new FIFO pointer.
* Finally, descriptor completion status is in a different ring.
I'll write something up about that when its time to do so.
This was inspired by Linux ath9k and the reference driver but is a
reimplementation.
Obtained from: Linux ath9k, Qualcomm Atheros
traffic.
* Create sc_mgmt_txbuf and sc_mgmt_txdesc, initialise/free them appropriately.
* Create an enum to represent buffer types in the API.
* Extend ath_getbuf() and _ath_getbuf_locked() to take the above enum.
* Right now anything sent via ic_raw_xmit() allocates via ATH_BUFTYPE_MGMT.
This may not be very useful.
* Add ATH_BUF_MGMT flag (ath_buf.bf_flags) which indicates the current buffer
is a mgmt buffer and should go back onto the mgmt free list.
* Extend 'txagg' to include debugging output for both normal and mgmt txbufs.
* When checking/clearing ATH_BUF_BUSY, do it on both TX pools.
Tested:
* STA mode, with heavy UDP injection via iperf. This filled the TX queue
however BARs were still going out successfully.
TODO:
* Initialise the mgmt buffers with ATH_BUF_MGMT and then ensure the right
type is being allocated and freed on the appropriate list. That'd save
a write operation (to bf->bf_flags) on each buffer alloc/free.
* Test on AP mode, ensure that BAR TX and probe responses go out nicely
when the main TX queue is filled (eg with paused traffic to a TID,
awaiting a BAR to complete.)
PR: kern/168170
(or direct dispatch) behind the TXQ lock (which, remember, is doubling
as the TID lock too for now.)
This ensures that:
(a) the sequence number and the CCMP PN allocation is done together;
(b) overlapping transmit paths don't interleave frames, so we don't
end up with the original issue that triggered kern/166190.
Ie, that we don't end up with seqno A, B in thread 1, C, D in
thread 2, and they being queued to the software queue as "A C D B"
or similar, leading to the BAW stalls.
This has been tested:
* both STA and AP modes with INVARIANTS and WITNESS;
* TCP and UDP TX;
* both STA->AP and AP->STA.
STA is a Routerstation Pro (single CPU MIPS) and the AP is a dual-core
Centrino.
PR: kern/166190
scheduled from the head of the software queue rather than trying to
queue the newly given frame.
This leads to some rather unfortunate out of order (but still valid
as it's inside the BAW) frame TX.
This now:
* Always queues the frame at the end of the software queue;
* Tries to direct dispatch the frame at the head of the software queue,
to try and fill up the hardware queue.
TODO:
* I should likely try to queue as many frames to the hardware as I can
at this point, rather than doing one at a time;
* ath_tx_xmit_aggr() may fail and this code assumes that it'll schedule
the TID. Otherwise TX may stall.
PR: kern/166190
This is an unfortunate byproduct of how the routine is used - it's called
with the head frame on the queue, but if the frame is failed, it's inserted
into the tail of the queue.
Because of this, the sequence numbers would get all shuffled around and
the BAW would be bumped past this sequence number, that's now at the
end of the software queue. Then, whenever it's time for that frame
to be transmitted, it'll be immediately outside of the BAW and TX will
stall until the BAW catches up.
It can also result in all kinds of weird duplicate BAW frames, leading
to hilarious panics.
PR: kern/166190
This showed up when doing heavy UDP throughput on SMP machines.
The problem with this is because the 802.11 sequence number is being
allocated separately to the CCMP PN replay number (which is assigned
during ieee80211_crypto_encap()).
Under significant throughput (200+ MBps) the TX path would be stressed
enough that frame TX/retry would force sequence number and PN allocation
to be out of order. So once the frames were reordered via 802.11 seqnos,
the CCMP PN would be far out of order, causing most frames to be discarded
by the receiver.
I've fixed this in some local work by being forced to:
(a) deal with the issues that lead to the parallel TX causing out of
order sequence numbers in the first place;
(b) fix all the packet queuing issues which lead to strange (but mostly
valid) TX.
I'll begin fixing these in a subsequent commit or five.
PR: kern/166190
I've come across a weird scenario in net80211 where two TX streams will
happily attempt to setup an aggregation session together.
If we're very lucky, it happens concurrently on separate CPUs and the
total lack of locking in the net80211 aggregation code causes this stuff
to race. Badly.
So >1 call would occur to the ath(4) addba start, but only one call would
complete to addba complete or timeout. The TID would thus stay paused.
The real fix is to implement some proper per-node (or maybe per-TID)
locking in net80211, which then could be leveraged by the ath(4) TX
aggregation code.
Whilst I'm at it, shuffle around the debugging messages a bit.
I like to keep people on their toes.
* migrate the rx processing out into if_ath_rx.c
* migrate the TSF functions into if_ath_tsf.h, as inlines
This is in prepration for supporting the EDMA RX routines, required to
support the AR93xx series NICs.
TODO:
* ath_start() shouldn't be private, but it's called as part of
the RX path. I should likely migrate ath_rx_tasklet() back into
if_ath.c and then return this to be 'static'. The RX code really
shouldn't need to see TX routines (and vice versa.)
* ath_beacon_* should be in if_ath_beacon.[ch].
* ath_tdma_* should be in if_ath_tdma.[ch] ...
add some more BAR debugging logic.
* Change the definition of ath_debug and ath_softc.sc_debug from
int to uint64_t;
* Change the relevant sysctls;
* Add a new BAR TX debugging field;
* Use this in if_ath_tx.
This has been tested by using the sysctl program, which happily allows
for fields > 32 bits to be configured.
Although I _should_ handle the other errors in various ways (specifically
errors like FILT), treating them as having transmitted successfully
is completely wrong. Here, they'd be counted as successful and the BAW
would be advanced.. but the RX side wouldn't have received them.
The specific errors I've been seeing here are HAL_TXERR_FILT.
This patch does fix the issue - I've tested it using -i 0.001 pings
(enough to start aggregation) and now the behaviour is correct:
* The RX side never sees a "moved window" error, and
* The TX side sends BARs as needed, with the RX side correctly handling
them.
PR: kern/167902
damage which I committed when I had less clue about such things.
Don't ever put normal data frames on the mcast software queue.
Just put mcast frames there if needed.
Pass the txq decision into ath_tx_normal_setup(), as we've already made
the decision. Don't re-do it.
Whilst i'm here, add another random debugging statement.
call these after rate control selection is done.
The duration/protection code wasn't working - it expected the rix to
be valid. Unfortunately after I moved the rate control selection into
late in the process, the rix value isn't valid and thus the protection/
duration code would get things wrong.
HT frames are now correctly protected with an RTS and for the AR5416,
this involves having the aggregate frames be limited to 8K.
TODO:
* Fix up the DMA sync to occur just before the frame is queued to the
hardware. I'm adjusting the duration here but not doing the DMA
flush.
* Doubly/triply ensure that the aggregate frames are being limited to
the correct size, or the AR5416 will get unhappy when TXing RTS-protected
aggregates.
A BAR frame must be transmitted when an frame in an A-MPDU session fails
to transmit - it's retried too often, or it can't be cloned for
re-transmission. The BAR frame tells the remote side to advance the
left edge of the block-ack window (BAW) to a new value.
In order to do this:
* TX for that particular node/TID must be paused;
* The existing frames in the hardware queue needs to be completed, whether
they're TXed successfully or otherwise;
* The new left edge of the BAW is then communicated to the remote side
via a BAR frame;
* Once the BAR frame has been sucessfully TXed, aggregation can resume;
* If the BAR frame can't be successfully TXed, the aggregation session
is torn down.
This is a first pass that implements the above. What needs to be done/
tested:
* What happens during say, a channel reset / stuck beacon _and_ BAR
TX. It _should_ be correctly buffered and retried once the
reset has completed. But if a bgscan occurs (and they shouldn't,
grr) the BAR frame will be forcibly failed and the aggregation session
will be torn down.
Yes, another reason to disable bgscan until I've figured this out.
* There's way too much locking going on here. I'm going to do a couple
of further passes of sanitising and refactoring so the (re) locking
isn't so heavy. Right now I'm going for correctness, not speed.
* The BAR TX can fail if the hardware TX queue is full. Since there's
no "free" space kept for management frames, a full TX queue (from eg
an iperf test) can race with your ability to allocate ath_buf/mbufs
and cause issues. I'll knock this on the head with a subsequent
commit.
* I need to do some _much_ more thorough testing in hostap mode to ensure
that many concurrent traffic streams to different end nodes are correctly
handled. I'll find and squish whichever bugs show up here.
But, this is an important step to being able to flip on 802.11n by default.
The last issue (besides bug fixes, of course) is HT frame protection and
I'll address that in a subsequent commit.
Linux ath9k doesn't have this issue as it doesn't try queuing multi-
descriptor frames to the hardware.
Before, I was only setting the first and last descriptor in the final
frame correctly - and that was done by accident. The first descriptor in
the last sub-frame was being correctly updated by ath_tx_setds_11n();
the last descriptor in the last sub-frame was being correctly updated
by ath_buf_set_rate(). But both of those are "incorrect".
The correct behaviour is:
* AR_IsAggr is set for all descriptors for all subframes in an aggregate.
* AR_MoreAggr is set for all descriptors for all non-final sub-frames
in an aggregate.
Ie, all descriptors in the last sub-frame of an aggregate must have this
field set to 0.
I still need to do a couple of extra passes to ensure the pad delimiter
field is being correctly handled in all descriptors in the last sub-frame.
Right now ath_txq_sched() is mainly called from the TX ath_tx_processq()
routine, which is (mostly) done as part of the taskqueue. It shouldn't
be called outside the taskqueue.
But now that I'm about to flip back on BAR TX, I'm going to start
stressing the ath_tx_tid_pause() and ath_tx_tid_resume() paths.
What I don't want to have happen is a reschedule of the TID traffic
_during_ the completion of TX frames.
Ideally I'd like to have a way to flag back up to the processing code
that the current hardware queue should be rechecked for software TID
queue frames. But for now, this should suffice for the BAR TX case.
I may eventually delete this code once I've brought some further
sanity to the general TX queue/completion path.
within the BAW.
This regression was introduced in ane earlier commit by me to fix the
BAW seqno allocation-but-not-insertion-into-BAW race. Since it was only
ever using the to-be allocated sequence number, any frame retries
with the first frame in the BAW still in the software queue would
have constantly failed, as ni_txseqs[tid] would always be outside
the BAW.
TODO:
* Extract out the mostly common code here in the agg and non-agg ADDBA
case and stuff it into a single function.
PR: kern/166357
I see traffic stalls.
It turns out that the bug isn't because the first and last frame in the
BAW is in the software queue. It is more likely that it's because
the first frame in the BAW is still in the software queue and thus there's
no more room to allocate and do subsequent TX.
PR: kern/166357