57bb282532
- Only update the rx ring consumer pointer after running through the rx loop, not with each iteration through the loop. - If possible, use a fast interupt handler instead of an ithread handler. Use the interrupt handler to check and squelch the interrupt, then schedule a taskqueue to do the actual work. This has three benefits: - Eliminates the 'interrupt aliasing' problem found in many chipsets by allowing the driver to mask the interrupt in the NIC instead of the OS masking the interrupt in the APIC. - Allows the driver to control the amount of work done in the interrupt handler. This results in what I call 'adaptive polling', where you get the latency benefits of a quick response to interrupts with the interrupt mitigation and work partitioning of polling. Polling is still an option in the driver, but I consider it orthogonal to this work. - Don't hold the driver lock in the RX handler. The handler and all data associated is effectively serialized already. This eliminates the cost of dropping and reaquiring the lock for every receieved packet. The result is much lower contention for the driver lock, resulting in lower CPU usage and lower latency for interactive workloads. The amount of work done in the taskqueue is controlled by the sysctl dev.em.N.rx_processing_limit and tunable hw.em.rx_process_limit Setting these to -1 effectively removes the limit. The fast interrupt and taskqueue can be disabled by defining NO_EM_FASTINTR. This work has been shown to increase fast-forwarding from ~570 kpps to ~750 kpps (note that the same NIC hardware seems unable to transmit more than 800 kpps, so this increase appears to be limited almost solely by the hardware). Gains have been shown in other workloads, ranging from better performance to elimination of over-saturation livelocks. Thanks to Andre Opperman for his time and resources from his network performance project in performing much of the testing. Thanks to Gleb Smirnoff and Danny Braniss for their help in testing also.