673d798918
implementing parallel TX and TX/RX completion can be done without simply abusing long-held locks. Right now, multiple concurrent ath_start() entries can result in frames being dequeued out of order. Well, they're dequeued in order fine, but if there's any preemption or race between CPUs between: * removing the frame from the ifnet, and * calling and runningath_tx_start(), until the frame is placed on a software or hardware TXQ Then although dequeueing the frame is in-order, queueing it to the hardware may be out of order. This is solved in a lot of other drivers by just holding a TX lock over a rather long period of time. This lets them continue to direct dispatch without races between dequeue and hardware queue. Note to observers: if_transmit() doesn't necessarily solve this. It removes the ifnet from the main path, but the same issue exists if there's some intermediary queue (eg a bufring, which as an aside also may pull in ifnet when you're using ALTQ.) So, until I can sit down and code up a much better way of doing parallel TX, I'm going to leave the TX path using a deferred taskqueue task. What I will likely head towards is doing a direct dispatch to hardware or software via if_transmit(), but it'll require some driver changes to allow queues to be made without using the really large ath_buf / ath_desc entries. TODO: * Look at how feasible it'll be to just do direct dispatch to ath_tx_start() from if_transmit(), avoiding doing _any_ intermediary serialisation into a global queue. This may break ALTQ for example, so I have to be delicate. * It's quite likely that I should break up ath_tx_start() so it deposits frames onto the software queues first, and then only fill in the 802.11 fields when it's being queued to the hardware. That will make the if_transmit() -> software queue path very quick and lightweight. * This has some very bad behaviour when using ACPI and Cx states. I'll do some subsequent analysis using KTR and schedgraph and file a follow-up PR or two. PR: kern/168649