Commit Graph

8 Commits

Author SHA1 Message Date
Gavin Hu
49594a6314 ring/c11: relax ordering for load and store of the head
When calling __atomic_compare_exchange_n, use relaxed ordering for the
success case, as multiple producers/consumers do not release updates to
each other so no need for acquire or release ordering.

Because the thread fence in place, ordering for the first iteration can
be relaxed.

Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core,4 threads/core,2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i

Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34

With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.59
MP/MC bulk enq/dequeue (size: 8): 10.54
SP/SC bulk enq/dequeue (size: 32): 1.73
MP/MC bulk enq/dequeue (size: 32): 2.38

No significant improvement, nor regression was seen, as the optimisation
is not at the critical path.

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-13 17:00:58 +01:00
Gavin Hu
86757c2c3e ring/c11: keep deterministic order allowing retry to work
Use case scenario:
1) Thread 1 is enqueuing. It reads prod.head and gets stalled for some
   reasons (running out of cpu time, preempted,...)
2) Thread 2 is enqueuing. It succeeds in enqueuing and moves prod.head
   forward.
3) Thread 3 is dequeuing. It succeeds in dequeuing and moves the cons.tail
   beyond the prod.head read by thread 1.
4) Thread 1 is re-scheduled. It reads cons.tail.

cpu1(producer)      cpu2(producer)          cpu3(consumer)
load r->prod.head
    ^               load r->prod.head
    |               load r->cons.tail
    |               store r->prod.head(+n)
  stalled           <-- enqueue ----->
    |               store r->prod.tail(+n)
    |                                        load r->cons.head
    |                                        load r->prod.tail
    |                                        store r->cons.head(+n)
    |                                        <...dequeue.....>
    v                                        store r->cons.tail(+n)
load r->cons.tail

For thread 1, the __atomic_compare_exchange_n detects the outdated
prod.head and retry the flow with the new one. This retry flow works ok on
strong ordering platform(eg:x86). But for weak ordering platforms(arm,
ppc), loading cons.tail and prod.head might be re-ordered, prod.head is new
but cons.tail becomes too old, the retry flow, based on the detection of
outdated head, does not trigger as expected, thus the outdate cons.tail
causes wrong free_entries.

Similarly, for dequeuing, outdated prod.tail leads to wrong avail_entries.

The fix is to keep the deterministic order of two loads allowing the retry
to work.

Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core, 4 threads/core, 2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i

Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.64
MP/MC bulk enq/dequeue (size: 8): 9.58
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 2.30

With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34

The results showed the thread fence degrade the performance slightly, but
it is required for correctness.

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-13 16:57:58 +01:00
Gavin Hu
047adc1724 ring/c11: move atomic load of head above the loop
In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-05 14:34:27 +01:00
Gavin Hu
9ed8770628 ring/c11: synchronize load and store of the tail
Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-05 14:34:19 +01:00
Andy Green
44c41b8577 ring: fix declaration after statement
On gcc 5.4.0 / native aarch64 from Ubuntu 16.04:

In function '__rte_ring_move_prod_head':
rte_ring_c11_mem.h:69:3: warning:
ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
   const uint32_t cons_tail = r->cons.tail;
   ^

In function '__rte_ring_move_cons_head':
rte_ring_c11_mem.h:136:3: warning:
ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
   const uint32_t prod_tail = r->prod.tail;
   ^

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Andy Green <andy@warmcat.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-07-26 16:11:51 +02:00
Andy Green
e8ed5056c8 ring: remove signed type flip-flopping
GCC 8.1 warns:

rte_ring.h:350:46:
warning: conversion to 'uint32_t' {aka 'unsigned int'}
from 'int' may change the sign of the result
[-Wsign-conversion]
  update_tail(&r->prod, prod_head, prod_next, is_sp, 1);

The visible apis take unsigned int, then call a private
api taking an int, which finally calls an api taking an
unsigned int.

Convert the private api to take unsigned int removing
5 x warning similar to that shown above.

Fixes: 0dfc98c507 ("ring: separate out head index manipulation")
Cc: stable@dpdk.org

Signed-off-by: Andy Green <andy@warmcat.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-05-21 00:20:16 +02:00
Jia He
9ffdaaa122 ring: convert license headers to SPDX tags
Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
2018-02-01 01:46:45 +01:00
Jia He
39368ebfc6 ring: introduce C11 memory model barrier option
This patch is to support C11 memory model barrier in librte_ring.

There are 2 barrier implementation options in librte_ring (suggested
by Jerin).
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [1]).
The reason why providing 2 options is the performance benchmark
difference in different arm machines, refer to [2].

CONFIG_RTE_RING_USE_C11_MEM_MODEL is provided, and by default it is "n"
on any architectures and only "y" on arm64 so far.

[1] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[2] http://dpdk.org/ml/archives/dev/2017-October/080861.html

Suggested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Jianbo Liu <jianbo.liu@arm.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
2018-01-29 16:10:20 +01:00