freebsd-skq/sys
Adrian Chadd cce6344402 [ath] [ath_rate] Extend ath_rate_sample to better handle 11n rates and aggregates.
My initial rate control code was .. suboptimal.  I wanted to at least get MCS
rates sent, but it didn't do anywhere near enough to handle low signal level links
or remotely keep accurate statistics.

So, 8 years later, here's what I should've done back then.

* Firstly, I wasn't at all tracking packet sizes other than the two buckets
  (250 and 1600 bytes.)  So, extend it to include 4096, 8192, 16384, 32768 and
  65536.  I may go add 2048 at some point if I find it's useful.

  This is important for a few reasons.  First, when forming A-MPDU or AMSDU
  aggregates the frame sizes are larger, and thus the TX time calculation
  is woefully, increasingly wrong.  Secondly, the behaviour of 802.11 channels
  isn't some fixed thing, both due to channel conditions and radios themselves.
  Notably, there was some observations done a few years ago on 11n chipsets
  which noticed longer aggregates showed an increase in failed A-MPDU sub-frame
  reception as you got further along in the transmit time.  It could be due to
  a variety of things - transmitter linearity, channel conditions changing,
  frequency/phase drift, etc - but the observation was to potentially form
  shorter aggregates to improve BER.

* .. and then modify the ath TX path to report the length of the aggregate sent,
  so as the statistics kept would line up with the correct bucket.

* Then on the rate control look-up side - i was also only using the first frame
  length for an A-MPDU rate control lookup which isn't good enough here.
  So, add a new method that walks the TID software queue for that node to
  find out what the likely length of data available is.  It isn't ALL of the
  data in the queue because we'll only ever send enough data to fit inside the
  block-ack window, so limit how many bytes we return to roughly what ath_tx_form_aggr()
  would do.

* .. and cache that in the first ath_buf in the aggregate so it and the eventual
  AMPDU length can be returned to the rate control code.

* THEN, modify the rate control code to look at them both when deciding which bucket
  to attribute the sent frame on.  I'm erring on the side of caution and using the
  size bucket that the lookup is based on.

Ok, so now the rate lookups and statistics are "more correct".  However, MCS rates
are not the same as 11abg rates in that they're not a monotonically incrementing
set of faster rates and you can't assume that just because a given MCS rate fails,
the next higher one wouldn't work better or be a lower average tx time.

So, I had to do a bunch of surgery to the best rate and sample rate math.
This is the bit that's a WIP.

* First, simplify the statistics updates (update_stats()) to do a single pass on
  all rates.
* Next, make sure that each rate average tx time is updated based on /its/ failure/success.
  Eg if you sent a frame with { MCS15, MCS12, MCS8 } and MCS8 succeeded, MCS15 and MCS
  12 would have their average tx time updated for /their/ part of the transmission,
  not the whole transmission.
* Next, EWMA wasn't being fully calculated based on the /failures/ in each of the
  rate attempts.  So, if MCS15, MCS12 failed above but MCS8 didn't, then ensure
  that the statistics noted that /all/ subframes failed at those rates, rather than
  the eventual set of transmitted/sent frames.   This ensures the EWMA /and/ average
  TX time are updated correctly.
* When picking a sample rate and initial rate, probe rates aroud the current MCS
  but limit it to MCS0..7 /for all spatial streams/, rather than doing crazy things
  like hitting MCS7 and then probing MCS8 - MCS8 is basically MCS0 but two spatial
  streams.  It's a /lot/ slower than MCS7.  Also, the reverse is true - if we're at
  MCS8 then don't probe MCS7 as part of it, it's not likely to succeed.
* Fix bugs in pick_best_rate() where I was /immediately/ choosing the highest MCS
  rate if there weren't any frames yet transmitted.  I was defaulting to 25% EWMA and
  .. then each comparison would accept the higher rate.  Just skip those; sampling
  will fill in the details.

So, this seems to work a lot better.  It's not perfect; I'm still seeing a lot of
instability around higher MCS rates because there are bursts of loss/retransmissions
that aren't /too/ bad.  But i'll keep iterating over this and tidying up my hacks.

Ok, so why this still something I'm poking at? rather than porting minstrel_ht?

ath_rate_sample tries to minimise airtime, not maximise throughput.  I have
extended it with an EWMA based on sub-frame success/failures - high MCS rates
that have partially successful receptions still show super short average frame
times, but a /lot/ of retransmits have to happen for that to work.
So for MCS rates I also track this EWMA and ensure that the rates I'm choosing
don't have super crappy packet failures.  I don't mind not getting lower
peak throughput versus minstrel_ht; instead I want to see if I can make "minimise
airtime" work well.

Tested:

* AR9380, STA mode
* AR9344, STA mode
* AR9580, STA/AP mode
2020-05-15 18:51:20 +00:00
..
amd64 vmm(4), bhyve(8): Expose kernel-emulated special devices to userspace 2020-05-15 15:54:22 +00:00
arm Revert r360944 and r360946 until reported issues can be resolved 2020-05-12 04:34:26 +00:00
arm64 Remove arm64_idcache_wbinv_range as it's unused. 2020-05-15 13:33:48 +00:00
bsm bsm: add AUE_CLOSERANGE 2020-04-24 01:27:25 +00:00
cam Add nvd alias back to nda now that it actually works. 2020-05-13 19:17:35 +00:00
cddl Avoid the GEOM topology lock recursion when we automatically expand a pool. 2020-04-25 21:45:31 +00:00
compat linuxkpi: Add EBADRQC to errno.h 2020-05-13 07:49:12 +00:00
conf Remove tests for obsolete compilers in the build system 2020-05-12 15:22:40 +00:00
contrib [ath_hal_ar9300] Ensure AH_BYTE_ORDER is defined before used. 2020-05-12 02:23:11 +00:00
crypto Remove MD5 HMAC from OCF. 2020-05-11 22:08:08 +00:00
ddb kernel: provide panicky version of __unreachable 2020-05-13 18:07:37 +00:00
dev [ath] [ath_rate] Extend ath_rate_sample to better handle 11n rates and aggregates. 2020-05-15 18:51:20 +00:00
dts allwinner: aw_thermal: Cope with DTS changes 2020-04-14 19:05:17 +00:00
fs Remove unused header for DES. 2020-05-13 18:35:02 +00:00
gdb Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) 2020-02-26 14:26:36 +00:00
geom Reimplement aliases in geom 2020-05-13 19:17:28 +00:00
gnu dts: Import DTS from Linux 5.6 2020-04-14 18:57:00 +00:00
i386 Fix the i386 build after r361033. 2020-05-14 17:56:44 +00:00
isa sc(4) md bits: stop setting sc->kbd entirely 2019-12-30 02:07:55 +00:00
kern Improve comment for compat32 handling of sysctl hw.pagesizes. 2020-05-15 13:53:10 +00:00
kgssapi Remove support for Kernel GSS algorithms deprecated in r348875. 2020-04-10 23:08:41 +00:00
libkern Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) 2020-02-26 14:26:36 +00:00
mips Revert r360944 and r360946 until reported issues can be resolved 2020-05-12 04:34:26 +00:00
modules Remove tests for obsolete compilers in the build system 2020-05-12 15:22:40 +00:00
net kernel: provide panicky version of __unreachable 2020-05-13 18:07:37 +00:00
net80211 [net80211] Use the unicast key when transmitting DWDS AP multicast frames. 2020-05-08 17:01:33 +00:00
netgraph Add space for RSSI in data member. 2020-05-09 14:15:44 +00:00
netinet Allow only IPv4 addresses in sendto() for TCP on AF_INET sockets. 2020-05-15 14:06:37 +00:00
netinet6 IPv6: Fix a panic in the nd6 code with unmapped mbufs. 2020-05-12 17:18:44 +00:00
netipsec Don't pass bogus keys down for NULL algorithms. 2020-05-02 01:00:29 +00:00
netpfil pf: Don't allocate per-table entry counters unless required. 2020-05-11 18:47:38 +00:00
netsmb Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) 2020-02-26 14:26:36 +00:00
nfs Remove rtable dumping code from bootp. 2020-04-28 07:23:41 +00:00
nfsclient
nfsserver
nlm Make nfslockd depend on xdr. 2020-04-23 09:37:22 +00:00
ofed Convert OFED rtable interactions to the new routing KPI. 2020-04-15 13:06:55 +00:00
opencrypto Trim a few more things I missed from xform_enc.h. 2020-05-13 18:36:02 +00:00
powerpc Revert r360944 and r360946 until reported issues can be resolved 2020-05-12 04:34:26 +00:00
riscv riscv: Fix pmap_protect for superpages 2020-05-13 17:20:51 +00:00
rpc Split XDR into separate kernel module. Make krpc depend on xdr. 2020-04-17 06:04:20 +00:00
security audit_canon_path_vp: don't panic if cdir == NULL 2020-04-17 02:09:31 +00:00
sys kernel: provide panicky version of __unreachable 2020-05-13 18:07:37 +00:00
teken
tests Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) 2020-02-26 14:26:36 +00:00
tools vfs: stop null checking routines in vop wrappers 2020-01-26 00:41:38 +00:00
ufs Retire two unused background fsck sysctls. 2020-04-21 17:42:32 +00:00
vm Allocate UMA per-CPU counters earlier. 2020-05-14 16:06:54 +00:00
x86 Call acpi_pxm_set_proximity_info() slightly earlier on x86. 2020-05-14 16:07:27 +00:00
xdr Split XDR into separate kernel module. Make krpc depend on xdr. 2020-04-17 06:04:20 +00:00
xen Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (16 of many) 2020-02-25 19:04:39 +00:00
Makefile Remove sparc64 kernel support 2020-02-03 17:35:11 +00:00