[ath] fix thresholds for deciding to queue to the software queue and populate hardware frames

This is two fixes, which establishes what I /think/ is pretty close to the theoretical PHY maximum speed on the AR9380 devices. * When doing A-MPDU on a TID, don't queue to the hardware directly if the hardware queue is busy. This gives us time to get more packets queued up (and the hardware is busy, so there's no point in queuing more to the hardware right now) to potentially form an A-MPDU. This fixes up the throughput issue I was seeing where a couple hundred single frames were being sent a second interspersed between A-MPDU frames. It just happened that the software queue had exactly one frame in it at that point. Queuing it until the hardware finishes transmitting isn't exactly costly. * When determining whether to dequeue from a software node/TID queue into the hardware queue, fix up the checks to work right for EDMA chips (ar9380 and later.) Before it was not dispatching anything until the FIFO was empty. Now we allow it to dispatch another aggregate up to the hardware aggregate limit, like I intended with the earlier work. This allows a 5GHz HT40, short-GI, "htprotmode off" test at MCS23 to achieve 357 Mbit/sec in a one-way UDP test. The stars have to be aligned /just right/ so there are no retries but it can happen. Just don't expect it to work in an OTA test if your 2yo is running around the room - MCS23 is very very sensitive to channel conditions. Tested: * AR9380 STA (test) -> AR9580 hostap TODO: * More thorough testing on pre-AR9380 chips (AR5416, AR9160, AR9280) * (Finally) teach ath_rate_sample about throughput/latency rather than air time, so I can get good transmit rates with a 2yo running around.
svn path=/head/; revision=312661
2017-01-23 04:30:08 +00:00 · 2017-01-23 04:30:08 +00:00 · 57af292d36 · 2020-12-20 02:59:44 +00:00
commit 57af292d36
parent 6eb9f206a6
1 changed files with 51 additions and 16 deletions
--- a/sys/dev/ath/if_ath_tx.c
+++ b/sys/dev/ath/if_ath_tx.c
@ -3129,7 +3129,21 @@ ath_tx_swq(struct ath_softc *sc, struct ieee80211_node *ni,
 		ATH_TID_INSERT_TAIL(atid, bf, bf_list);
 		/* XXX sched? */
 	} else if (ath_tx_ampdu_running(sc, an, tid)) {
-		/* AMPDU running, attempt direct dispatch if possible */
+		/*
+		 * AMPDU running, queue single-frame if the hardware queue
+		 * isn't busy.
+		 *
+		 * If the hardware queue is busy, sending an aggregate frame
+		 * then just hold off so we can queue more aggregate frames.
+		 *
+		 * Otherwise we may end up with single frames leaking through
+		 * because we are dispatching them too quickly.
+		 *
+		 * TODO: maybe we should treat this as two policies - minimise
+		 * latency, or maximise throughput.  Then for BE/BK we can
+		 * maximise throughput, and VO/VI (if AMPDU is enabled!)
+		 * minimise latency.
+		 */

 		/*
 		 * Always queue the frame to the tail of the list.
@ -3138,18 +3152,18 @@ ath_tx_swq(struct ath_softc *sc, struct ieee80211_node *ni,

 		/*
 		 * If the hardware queue isn't busy, direct dispatch
-		 * the head frame in the list.  Don't schedule the
-		 * TID - let it build some more frames first?
+		 * the head frame in the list.
 		 *
-		 * When running A-MPDU, always just check the hardware
-		 * queue depth against the aggregate frame limit.
-		 * We don't want to burst a large number of single frames
-		 * out to the hardware; we want to aggressively hold back.
+		 * Note: if we're say, configured to do ADDBA but not A-MPDU
+		 * then maybe we want to still queue two non-aggregate frames
+		 * to the hardware.  Again with the per-TID policy
+		 * configuration..)
 		 *
 		 * Otherwise, schedule the TID.
 		 */
 		/* XXX TXQ locking */
-		if (txq->axq_depth + txq->fifo.axq_depth < sc->sc_hwq_limit_aggr) {
+		if (txq->axq_depth + txq->fifo.axq_depth == 0) {
+
 			bf = ATH_TID_FIRST(atid);
 			ATH_TID_REMOVE(atid, bf, bf_list);

@ -5615,19 +5629,40 @@ ath_txq_sched(struct ath_softc *sc, struct ath_txq *txq)
 	ATH_TX_LOCK_ASSERT(sc);

 	/*
-	 * Don't schedule if the hardware queue is busy.
-	 * This (hopefully) gives some more time to aggregate
-	 * some packets in the aggregation queue.
+	 * For non-EDMA chips, aggr frames that have been built are
+	 * in axq_aggr_depth, whether they've been scheduled or not.
+	 * There's no FIFO, so txq->axq_depth is what's been scheduled
+	 * to the hardware.
 	 *
-	 * XXX It doesn't stop a parallel sender from sneaking
-	 * in transmitting a frame!
+	 * For EDMA chips, we do it in two stages.  The existing code
+	 * builds a list of frames to go to the hardware and the EDMA
+	 * code turns it into a single entry to push into the FIFO.
+	 * That way we don't take up one packet per FIFO slot.
+	 * We do push one aggregate per FIFO slot though, just to keep
+	 * things simple.
+	 *
+	 * The FIFO depth is what's in the hardware; the txq->axq_depth
+	 * is what's been scheduled to the FIFO.
+	 *
+	 * fifo.axq_depth is the number of frames (or aggregates) pushed
+	 *  into the EDMA FIFO.  For multi-frame lists, this is the number
+	 *  of frames pushed in.
+	 * axq_fifo_depth is the number of FIFO slots currently busy.
 	 */
-	/* XXX TXQ locking */
-	if (txq->axq_aggr_depth + txq->fifo.axq_depth >= sc->sc_hwq_limit_aggr) {
+
+	/* For EDMA and non-EDMA, check built/scheduled against aggr limit */
+	if (txq->axq_aggr_depth >= sc->sc_hwq_limit_aggr) {
 		sc->sc_aggr_stats.aggr_sched_nopkt++;
 		return;
 	}
-	if (txq->axq_depth >= sc->sc_hwq_limit_nonaggr) {
+
+	/*
+	 * For non-EDMA chips, axq_depth is the "what's scheduled to
+	 * the hardware list".  For EDMA it's "What's built for the hardware"
+	 * and fifo.axq_depth is how many frames have been dispatched
+	 * already to the hardware.
+	 */
+	if (txq->axq_depth + txq->fifo.axq_depth >= sc->sc_hwq_limit_nonaggr) {
 		sc->sc_aggr_stats.aggr_sched_nopkt++;
 		return;
 	}