[ath] fix thresholds for deciding to queue to the software queue and populate hardware frames

This is two fixes, which establishes what I /think/ is pretty close to the
theoretical PHY maximum speed on the AR9380 devices.

* When doing A-MPDU on a TID, don't queue to the hardware directly if
  the hardware queue is busy.  This gives us time to get more packets
  queued up (and the hardware is busy, so there's no point in queuing
  more to the hardware right now) to potentially form an A-MPDU.

  This fixes up the throughput issue I was seeing where a couple hundred
  single frames were being sent a second interspersed between A-MPDU
  frames.  It just happened that the software queue had exactly one
  frame in it at that point.  Queuing it until the hardware finishes
  transmitting isn't exactly costly.

* When determining whether to dequeue from a software node/TID queue into
  the hardware queue, fix up the checks to work right for EDMA chips
  (ar9380 and later.)   Before it was not dispatching anything until
  the FIFO was empty.  Now we allow it to dispatch another aggregate
  up to the hardware aggregate limit, like I intended with the earlier
  work.

This allows a 5GHz HT40, short-GI, "htprotmode off" test at MCS23
to achieve 357 Mbit/sec in a one-way UDP test.  The stars have to be
aligned /just right/ so there are no retries but it can happen.
Just don't expect it to work in an OTA test if your 2yo is running
around the room - MCS23 is very very sensitive to channel conditions.

Tested:

* AR9380 STA (test) -> AR9580 hostap

TODO:

* More thorough testing on pre-AR9380 chips (AR5416, AR9160, AR9280)
* (Finally) teach ath_rate_sample about throughput/latency rather than
  air time, so I can get good transmit rates with a 2yo running around.
This commit is contained in:
Adrian Chadd 2017-01-23 04:30:08 +00:00
parent 6eb9f206a6
commit 57af292d36
Notes: svn2git 2020-12-20 02:59:44 +00:00
svn path=/head/; revision=312661

View File

@ -3129,7 +3129,21 @@ ath_tx_swq(struct ath_softc *sc, struct ieee80211_node *ni,
ATH_TID_INSERT_TAIL(atid, bf, bf_list);
/* XXX sched? */
} else if (ath_tx_ampdu_running(sc, an, tid)) {
/* AMPDU running, attempt direct dispatch if possible */
/*
* AMPDU running, queue single-frame if the hardware queue
* isn't busy.
*
* If the hardware queue is busy, sending an aggregate frame
* then just hold off so we can queue more aggregate frames.
*
* Otherwise we may end up with single frames leaking through
* because we are dispatching them too quickly.
*
* TODO: maybe we should treat this as two policies - minimise
* latency, or maximise throughput. Then for BE/BK we can
* maximise throughput, and VO/VI (if AMPDU is enabled!)
* minimise latency.
*/
/*
* Always queue the frame to the tail of the list.
@ -3138,18 +3152,18 @@ ath_tx_swq(struct ath_softc *sc, struct ieee80211_node *ni,
/*
* If the hardware queue isn't busy, direct dispatch
* the head frame in the list. Don't schedule the
* TID - let it build some more frames first?
* the head frame in the list.
*
* When running A-MPDU, always just check the hardware
* queue depth against the aggregate frame limit.
* We don't want to burst a large number of single frames
* out to the hardware; we want to aggressively hold back.
* Note: if we're say, configured to do ADDBA but not A-MPDU
* then maybe we want to still queue two non-aggregate frames
* to the hardware. Again with the per-TID policy
* configuration..)
*
* Otherwise, schedule the TID.
*/
/* XXX TXQ locking */
if (txq->axq_depth + txq->fifo.axq_depth < sc->sc_hwq_limit_aggr) {
if (txq->axq_depth + txq->fifo.axq_depth == 0) {
bf = ATH_TID_FIRST(atid);
ATH_TID_REMOVE(atid, bf, bf_list);
@ -5615,19 +5629,40 @@ ath_txq_sched(struct ath_softc *sc, struct ath_txq *txq)
ATH_TX_LOCK_ASSERT(sc);
/*
* Don't schedule if the hardware queue is busy.
* This (hopefully) gives some more time to aggregate
* some packets in the aggregation queue.
* For non-EDMA chips, aggr frames that have been built are
* in axq_aggr_depth, whether they've been scheduled or not.
* There's no FIFO, so txq->axq_depth is what's been scheduled
* to the hardware.
*
* XXX It doesn't stop a parallel sender from sneaking
* in transmitting a frame!
* For EDMA chips, we do it in two stages. The existing code
* builds a list of frames to go to the hardware and the EDMA
* code turns it into a single entry to push into the FIFO.
* That way we don't take up one packet per FIFO slot.
* We do push one aggregate per FIFO slot though, just to keep
* things simple.
*
* The FIFO depth is what's in the hardware; the txq->axq_depth
* is what's been scheduled to the FIFO.
*
* fifo.axq_depth is the number of frames (or aggregates) pushed
* into the EDMA FIFO. For multi-frame lists, this is the number
* of frames pushed in.
* axq_fifo_depth is the number of FIFO slots currently busy.
*/
/* XXX TXQ locking */
if (txq->axq_aggr_depth + txq->fifo.axq_depth >= sc->sc_hwq_limit_aggr) {
/* For EDMA and non-EDMA, check built/scheduled against aggr limit */
if (txq->axq_aggr_depth >= sc->sc_hwq_limit_aggr) {
sc->sc_aggr_stats.aggr_sched_nopkt++;
return;
}
if (txq->axq_depth >= sc->sc_hwq_limit_nonaggr) {
/*
* For non-EDMA chips, axq_depth is the "what's scheduled to
* the hardware list". For EDMA it's "What's built for the hardware"
* and fifo.axq_depth is how many frames have been dispatched
* already to the hardware.
*/
if (txq->axq_depth + txq->fifo.axq_depth >= sc->sc_hwq_limit_nonaggr) {
sc->sc_aggr_stats.aggr_sched_nopkt++;
return;
}