86757c2c3e
Use case scenario:
1) Thread 1 is enqueuing. It reads prod.head and gets stalled for some
reasons (running out of cpu time, preempted,...)
2) Thread 2 is enqueuing. It succeeds in enqueuing and moves prod.head
forward.
3) Thread 3 is dequeuing. It succeeds in dequeuing and moves the cons.tail
beyond the prod.head read by thread 1.
4) Thread 1 is re-scheduled. It reads cons.tail.
cpu1(producer) cpu2(producer) cpu3(consumer)
load r->prod.head
^ load r->prod.head
| load r->cons.tail
| store r->prod.head(+n)
stalled <-- enqueue ----->
| store r->prod.tail(+n)
| load r->cons.head
| load r->prod.tail
| store r->cons.head(+n)
| <...dequeue.....>
v store r->cons.tail(+n)
load r->cons.tail
For thread 1, the __atomic_compare_exchange_n detects the outdated
prod.head and retry the flow with the new one. This retry flow works ok on
strong ordering platform(eg:x86). But for weak ordering platforms(arm,
ppc), loading cons.tail and prod.head might be re-ordered, prod.head is new
but cons.tail becomes too old, the retry flow, based on the detection of
outdated head, does not trigger as expected, thus the outdate cons.tail
causes wrong free_entries.
Similarly, for dequeuing, outdated prod.tail leads to wrong avail_entries.
The fix is to keep the deterministic order of two loads allowing the retry
to work.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core, 4 threads/core, 2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.64
MP/MC bulk enq/dequeue (size: 8): 9.58
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 2.30
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
The results showed the thread fence degrade the performance slightly, but
it is required for correctness.
Fixes:
|
||
---|---|---|
.. | ||
librte_acl | ||
librte_bbdev | ||
librte_bitratestats | ||
librte_bpf | ||
librte_cfgfile | ||
librte_cmdline | ||
librte_compat | ||
librte_compressdev | ||
librte_cryptodev | ||
librte_distributor | ||
librte_eal | ||
librte_efd | ||
librte_ethdev | ||
librte_eventdev | ||
librte_flow_classify | ||
librte_gro | ||
librte_gso | ||
librte_hash | ||
librte_ip_frag | ||
librte_jobstats | ||
librte_kni | ||
librte_kvargs | ||
librte_latencystats | ||
librte_lpm | ||
librte_mbuf | ||
librte_member | ||
librte_mempool | ||
librte_meter | ||
librte_metrics | ||
librte_net | ||
librte_pci | ||
librte_pdump | ||
librte_pipeline | ||
librte_port | ||
librte_power | ||
librte_rawdev | ||
librte_reorder | ||
librte_ring | ||
librte_sched | ||
librte_security | ||
librte_table | ||
librte_telemetry | ||
librte_timer | ||
librte_vhost | ||
Makefile | ||
meson.build |