2018-01-21 23:13:48 -08:00
|
|
|
/* SPDX-License-Identifier: BSD-3-Clause
|
2018-01-21 20:41:28 -08:00
|
|
|
*
|
2018-01-21 23:13:48 -08:00
|
|
|
* Copyright (c) 2017,2018 HXT-semitech Corporation.
|
2018-01-21 20:41:28 -08:00
|
|
|
* Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
|
|
|
|
* All rights reserved.
|
2018-01-21 23:13:48 -08:00
|
|
|
* Derived from FreeBSD's bufring.h
|
|
|
|
* Used as BSD-3 Licensed with permission from Kip Macy.
|
|
|
|
*/
|
2018-01-21 20:41:28 -08:00
|
|
|
|
|
|
|
#ifndef _RTE_RING_C11_MEM_H_
|
|
|
|
#define _RTE_RING_C11_MEM_H_
|
|
|
|
|
|
|
|
static __rte_always_inline void
|
|
|
|
update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
|
|
|
|
uint32_t single, uint32_t enqueue)
|
|
|
|
{
|
|
|
|
RTE_SET_USED(enqueue);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there are other enqueues/dequeues in progress that preceded us,
|
|
|
|
* we need to wait for them to complete
|
|
|
|
*/
|
|
|
|
if (!single)
|
|
|
|
while (unlikely(ht->tail != old_val))
|
|
|
|
rte_pause();
|
|
|
|
|
|
|
|
__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* @internal This function updates the producer head for enqueue
|
|
|
|
*
|
|
|
|
* @param r
|
|
|
|
* A pointer to the ring structure
|
|
|
|
* @param is_sp
|
|
|
|
* Indicates whether multi-producer path is needed or not
|
|
|
|
* @param n
|
|
|
|
* The number of elements we will want to enqueue, i.e. how far should the
|
|
|
|
* head be moved
|
|
|
|
* @param behavior
|
|
|
|
* RTE_RING_QUEUE_FIXED: Enqueue a fixed number of items from a ring
|
|
|
|
* RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
|
|
|
|
* @param old_head
|
|
|
|
* Returns head value as it was before the move, i.e. where enqueue starts
|
|
|
|
* @param new_head
|
|
|
|
* Returns the current/new head value i.e. where enqueue finishes
|
|
|
|
* @param free_entries
|
|
|
|
* Returns the amount of free space in the ring BEFORE head was moved
|
|
|
|
* @return
|
|
|
|
* Actual number of objects enqueued.
|
|
|
|
* If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
|
|
|
|
*/
|
|
|
|
static __rte_always_inline unsigned int
|
2018-05-17 21:49:22 +08:00
|
|
|
__rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
|
2018-01-21 20:41:28 -08:00
|
|
|
unsigned int n, enum rte_ring_queue_behavior behavior,
|
|
|
|
uint32_t *old_head, uint32_t *new_head,
|
|
|
|
uint32_t *free_entries)
|
|
|
|
{
|
|
|
|
const uint32_t capacity = r->capacity;
|
ring/c11: synchronize load and store of the tail
Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.
The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.
The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i
Test result with this patch(two cores):
SP/SC bulk enq/dequeue (size: 8): 5.86
MP/MC bulk enq/dequeue (size: 8): 10.15
SP/SC bulk enq/dequeue (size: 32): 1.94
MP/MC bulk enq/dequeue (size: 32): 2.36
In comparison of the test result without this patch:
SP/SC bulk enq/dequeue (size: 8): 6.67
MP/MC bulk enq/dequeue (size: 8): 13.12
SP/SC bulk enq/dequeue (size: 32): 2.04
MP/MC bulk enq/dequeue (size: 32): 3.26
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-02 19:21:27 +08:00
|
|
|
uint32_t cons_tail;
|
2018-01-21 20:41:28 -08:00
|
|
|
unsigned int max = n;
|
|
|
|
int success;
|
|
|
|
|
ring/c11: relax ordering for load and store of the head
When calling __atomic_compare_exchange_n, use relaxed ordering for the
success case, as multiple producers/consumers do not release updates to
each other so no need for acquire or release ordering.
Because the thread fence in place, ordering for the first iteration can
be relaxed.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core,4 threads/core,2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.59
MP/MC bulk enq/dequeue (size: 8): 10.54
SP/SC bulk enq/dequeue (size: 32): 1.73
MP/MC bulk enq/dequeue (size: 32): 2.38
No significant improvement, nor regression was seen, as the optimisation
is not at the critical path.
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-09 19:42:47 +08:00
|
|
|
*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
|
2018-01-21 20:41:28 -08:00
|
|
|
do {
|
|
|
|
/* Reset n to the initial burst count */
|
|
|
|
n = max;
|
|
|
|
|
ring/c11: keep deterministic order allowing retry to work
Use case scenario:
1) Thread 1 is enqueuing. It reads prod.head and gets stalled for some
reasons (running out of cpu time, preempted,...)
2) Thread 2 is enqueuing. It succeeds in enqueuing and moves prod.head
forward.
3) Thread 3 is dequeuing. It succeeds in dequeuing and moves the cons.tail
beyond the prod.head read by thread 1.
4) Thread 1 is re-scheduled. It reads cons.tail.
cpu1(producer) cpu2(producer) cpu3(consumer)
load r->prod.head
^ load r->prod.head
| load r->cons.tail
| store r->prod.head(+n)
stalled <-- enqueue ----->
| store r->prod.tail(+n)
| load r->cons.head
| load r->prod.tail
| store r->cons.head(+n)
| <...dequeue.....>
v store r->cons.tail(+n)
load r->cons.tail
For thread 1, the __atomic_compare_exchange_n detects the outdated
prod.head and retry the flow with the new one. This retry flow works ok on
strong ordering platform(eg:x86). But for weak ordering platforms(arm,
ppc), loading cons.tail and prod.head might be re-ordered, prod.head is new
but cons.tail becomes too old, the retry flow, based on the detection of
outdated head, does not trigger as expected, thus the outdate cons.tail
causes wrong free_entries.
Similarly, for dequeuing, outdated prod.tail leads to wrong avail_entries.
The fix is to keep the deterministic order of two loads allowing the retry
to work.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core, 4 threads/core, 2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.64
MP/MC bulk enq/dequeue (size: 8): 9.58
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 2.30
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
The results showed the thread fence degrade the performance slightly, but
it is required for correctness.
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-09 19:42:46 +08:00
|
|
|
/* Ensure the head is read before tail */
|
|
|
|
__atomic_thread_fence(__ATOMIC_ACQUIRE);
|
|
|
|
|
ring/c11: synchronize load and store of the tail
Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.
The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.
The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i
Test result with this patch(two cores):
SP/SC bulk enq/dequeue (size: 8): 5.86
MP/MC bulk enq/dequeue (size: 8): 10.15
SP/SC bulk enq/dequeue (size: 32): 1.94
MP/MC bulk enq/dequeue (size: 32): 2.36
In comparison of the test result without this patch:
SP/SC bulk enq/dequeue (size: 8): 6.67
MP/MC bulk enq/dequeue (size: 8): 13.12
SP/SC bulk enq/dequeue (size: 32): 2.04
MP/MC bulk enq/dequeue (size: 32): 3.26
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-02 19:21:27 +08:00
|
|
|
/* load-acquire synchronize with store-release of ht->tail
|
|
|
|
* in update_tail.
|
|
|
|
*/
|
|
|
|
cons_tail = __atomic_load_n(&r->cons.tail,
|
|
|
|
__ATOMIC_ACQUIRE);
|
|
|
|
|
|
|
|
/* The subtraction is done between two unsigned 32bits value
|
2018-01-21 20:41:28 -08:00
|
|
|
* (the result is always modulo 32 bits even if we have
|
|
|
|
* *old_head > cons_tail). So 'free_entries' is always between 0
|
|
|
|
* and capacity (which is < size).
|
|
|
|
*/
|
ring/c11: synchronize load and store of the tail
Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.
The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.
The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i
Test result with this patch(two cores):
SP/SC bulk enq/dequeue (size: 8): 5.86
MP/MC bulk enq/dequeue (size: 8): 10.15
SP/SC bulk enq/dequeue (size: 32): 1.94
MP/MC bulk enq/dequeue (size: 32): 2.36
In comparison of the test result without this patch:
SP/SC bulk enq/dequeue (size: 8): 6.67
MP/MC bulk enq/dequeue (size: 8): 13.12
SP/SC bulk enq/dequeue (size: 32): 2.04
MP/MC bulk enq/dequeue (size: 32): 3.26
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-02 19:21:27 +08:00
|
|
|
*free_entries = (capacity + cons_tail - *old_head);
|
2018-01-21 20:41:28 -08:00
|
|
|
|
|
|
|
/* check that we have enough room in ring */
|
|
|
|
if (unlikely(n > *free_entries))
|
|
|
|
n = (behavior == RTE_RING_QUEUE_FIXED) ?
|
|
|
|
0 : *free_entries;
|
|
|
|
|
|
|
|
if (n == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
*new_head = *old_head + n;
|
|
|
|
if (is_sp)
|
|
|
|
r->prod.head = *new_head, success = 1;
|
|
|
|
else
|
2018-11-02 19:21:28 +08:00
|
|
|
/* on failure, *old_head is updated */
|
2018-01-21 20:41:28 -08:00
|
|
|
success = __atomic_compare_exchange_n(&r->prod.head,
|
|
|
|
old_head, *new_head,
|
ring/c11: relax ordering for load and store of the head
When calling __atomic_compare_exchange_n, use relaxed ordering for the
success case, as multiple producers/consumers do not release updates to
each other so no need for acquire or release ordering.
Because the thread fence in place, ordering for the first iteration can
be relaxed.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core,4 threads/core,2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.59
MP/MC bulk enq/dequeue (size: 8): 10.54
SP/SC bulk enq/dequeue (size: 32): 1.73
MP/MC bulk enq/dequeue (size: 32): 2.38
No significant improvement, nor regression was seen, as the optimisation
is not at the critical path.
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-09 19:42:47 +08:00
|
|
|
0, __ATOMIC_RELAXED,
|
2018-01-21 20:41:28 -08:00
|
|
|
__ATOMIC_RELAXED);
|
|
|
|
} while (unlikely(success == 0));
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* @internal This function updates the consumer head for dequeue
|
|
|
|
*
|
|
|
|
* @param r
|
|
|
|
* A pointer to the ring structure
|
|
|
|
* @param is_sc
|
|
|
|
* Indicates whether multi-consumer path is needed or not
|
|
|
|
* @param n
|
|
|
|
* The number of elements we will want to enqueue, i.e. how far should the
|
|
|
|
* head be moved
|
|
|
|
* @param behavior
|
|
|
|
* RTE_RING_QUEUE_FIXED: Dequeue a fixed number of items from a ring
|
|
|
|
* RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
|
|
|
|
* @param old_head
|
|
|
|
* Returns head value as it was before the move, i.e. where dequeue starts
|
|
|
|
* @param new_head
|
|
|
|
* Returns the current/new head value i.e. where dequeue finishes
|
|
|
|
* @param entries
|
|
|
|
* Returns the number of entries in the ring BEFORE head was moved
|
|
|
|
* @return
|
|
|
|
* - Actual number of objects dequeued.
|
|
|
|
* If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
|
|
|
|
*/
|
|
|
|
static __rte_always_inline unsigned int
|
|
|
|
__rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
|
|
|
|
unsigned int n, enum rte_ring_queue_behavior behavior,
|
|
|
|
uint32_t *old_head, uint32_t *new_head,
|
|
|
|
uint32_t *entries)
|
|
|
|
{
|
|
|
|
unsigned int max = n;
|
ring/c11: synchronize load and store of the tail
Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.
The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.
The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i
Test result with this patch(two cores):
SP/SC bulk enq/dequeue (size: 8): 5.86
MP/MC bulk enq/dequeue (size: 8): 10.15
SP/SC bulk enq/dequeue (size: 32): 1.94
MP/MC bulk enq/dequeue (size: 32): 2.36
In comparison of the test result without this patch:
SP/SC bulk enq/dequeue (size: 8): 6.67
MP/MC bulk enq/dequeue (size: 8): 13.12
SP/SC bulk enq/dequeue (size: 32): 2.04
MP/MC bulk enq/dequeue (size: 32): 3.26
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-02 19:21:27 +08:00
|
|
|
uint32_t prod_tail;
|
2018-01-21 20:41:28 -08:00
|
|
|
int success;
|
|
|
|
|
|
|
|
/* move cons.head atomically */
|
ring/c11: relax ordering for load and store of the head
When calling __atomic_compare_exchange_n, use relaxed ordering for the
success case, as multiple producers/consumers do not release updates to
each other so no need for acquire or release ordering.
Because the thread fence in place, ordering for the first iteration can
be relaxed.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core,4 threads/core,2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.59
MP/MC bulk enq/dequeue (size: 8): 10.54
SP/SC bulk enq/dequeue (size: 32): 1.73
MP/MC bulk enq/dequeue (size: 32): 2.38
No significant improvement, nor regression was seen, as the optimisation
is not at the critical path.
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-09 19:42:47 +08:00
|
|
|
*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
|
2018-01-21 20:41:28 -08:00
|
|
|
do {
|
|
|
|
/* Restore n as it may change every loop */
|
|
|
|
n = max;
|
ring/c11: synchronize load and store of the tail
Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.
The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.
The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i
Test result with this patch(two cores):
SP/SC bulk enq/dequeue (size: 8): 5.86
MP/MC bulk enq/dequeue (size: 8): 10.15
SP/SC bulk enq/dequeue (size: 32): 1.94
MP/MC bulk enq/dequeue (size: 32): 2.36
In comparison of the test result without this patch:
SP/SC bulk enq/dequeue (size: 8): 6.67
MP/MC bulk enq/dequeue (size: 8): 13.12
SP/SC bulk enq/dequeue (size: 32): 2.04
MP/MC bulk enq/dequeue (size: 32): 3.26
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-02 19:21:27 +08:00
|
|
|
|
ring/c11: keep deterministic order allowing retry to work
Use case scenario:
1) Thread 1 is enqueuing. It reads prod.head and gets stalled for some
reasons (running out of cpu time, preempted,...)
2) Thread 2 is enqueuing. It succeeds in enqueuing and moves prod.head
forward.
3) Thread 3 is dequeuing. It succeeds in dequeuing and moves the cons.tail
beyond the prod.head read by thread 1.
4) Thread 1 is re-scheduled. It reads cons.tail.
cpu1(producer) cpu2(producer) cpu3(consumer)
load r->prod.head
^ load r->prod.head
| load r->cons.tail
| store r->prod.head(+n)
stalled <-- enqueue ----->
| store r->prod.tail(+n)
| load r->cons.head
| load r->prod.tail
| store r->cons.head(+n)
| <...dequeue.....>
v store r->cons.tail(+n)
load r->cons.tail
For thread 1, the __atomic_compare_exchange_n detects the outdated
prod.head and retry the flow with the new one. This retry flow works ok on
strong ordering platform(eg:x86). But for weak ordering platforms(arm,
ppc), loading cons.tail and prod.head might be re-ordered, prod.head is new
but cons.tail becomes too old, the retry flow, based on the detection of
outdated head, does not trigger as expected, thus the outdate cons.tail
causes wrong free_entries.
Similarly, for dequeuing, outdated prod.tail leads to wrong avail_entries.
The fix is to keep the deterministic order of two loads allowing the retry
to work.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core, 4 threads/core, 2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.64
MP/MC bulk enq/dequeue (size: 8): 9.58
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 2.30
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
The results showed the thread fence degrade the performance slightly, but
it is required for correctness.
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-09 19:42:46 +08:00
|
|
|
/* Ensure the head is read before tail */
|
|
|
|
__atomic_thread_fence(__ATOMIC_ACQUIRE);
|
|
|
|
|
ring/c11: synchronize load and store of the tail
Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.
The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.
The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i
Test result with this patch(two cores):
SP/SC bulk enq/dequeue (size: 8): 5.86
MP/MC bulk enq/dequeue (size: 8): 10.15
SP/SC bulk enq/dequeue (size: 32): 1.94
MP/MC bulk enq/dequeue (size: 32): 2.36
In comparison of the test result without this patch:
SP/SC bulk enq/dequeue (size: 8): 6.67
MP/MC bulk enq/dequeue (size: 8): 13.12
SP/SC bulk enq/dequeue (size: 32): 2.04
MP/MC bulk enq/dequeue (size: 32): 3.26
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-02 19:21:27 +08:00
|
|
|
/* this load-acquire synchronize with store-release of ht->tail
|
|
|
|
* in update_tail.
|
|
|
|
*/
|
|
|
|
prod_tail = __atomic_load_n(&r->prod.tail,
|
|
|
|
__ATOMIC_ACQUIRE);
|
|
|
|
|
2018-01-21 20:41:28 -08:00
|
|
|
/* The subtraction is done between two unsigned 32bits value
|
|
|
|
* (the result is always modulo 32 bits even if we have
|
|
|
|
* cons_head > prod_tail). So 'entries' is always between 0
|
|
|
|
* and size(ring)-1.
|
|
|
|
*/
|
ring/c11: synchronize load and store of the tail
Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.
The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.
The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i
Test result with this patch(two cores):
SP/SC bulk enq/dequeue (size: 8): 5.86
MP/MC bulk enq/dequeue (size: 8): 10.15
SP/SC bulk enq/dequeue (size: 32): 1.94
MP/MC bulk enq/dequeue (size: 32): 2.36
In comparison of the test result without this patch:
SP/SC bulk enq/dequeue (size: 8): 6.67
MP/MC bulk enq/dequeue (size: 8): 13.12
SP/SC bulk enq/dequeue (size: 32): 2.04
MP/MC bulk enq/dequeue (size: 32): 3.26
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2018-11-02 19:21:27 +08:00
|
|
|
*entries = (prod_tail - *old_head);
|
2018-01-21 20:41:28 -08:00
|
|
|
|
|
|
|
/* Set the actual entries for dequeue */
|
|
|
|
if (n > *entries)
|
|
|
|
n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
|
|
|
|
|
|
|
|
if (unlikely(n == 0))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
*new_head = *old_head + n;
|
|
|
|
if (is_sc)
|
|
|
|
r->cons.head = *new_head, success = 1;
|
|
|
|
else
|
2018-11-02 19:21:28 +08:00
|
|
|
/* on failure, *old_head will be updated */
|
2018-01-21 20:41:28 -08:00
|
|
|
success = __atomic_compare_exchange_n(&r->cons.head,
|
|
|
|
old_head, *new_head,
|
ring/c11: relax ordering for load and store of the head
When calling __atomic_compare_exchange_n, use relaxed ordering for the
success case, as multiple producers/consumers do not release updates to
each other so no need for acquire or release ordering.
Because the thread fence in place, ordering for the first iteration can
be relaxed.
Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core,4 threads/core,2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i
Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34
With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.59
MP/MC bulk enq/dequeue (size: 8): 10.54
SP/SC bulk enq/dequeue (size: 32): 1.73
MP/MC bulk enq/dequeue (size: 32): 2.38
No significant improvement, nor regression was seen, as the optimisation
is not at the critical path.
Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-09 19:42:47 +08:00
|
|
|
0, __ATOMIC_RELAXED,
|
2018-01-21 20:41:28 -08:00
|
|
|
__ATOMIC_RELAXED);
|
|
|
|
} while (unlikely(success == 0));
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* _RTE_RING_C11_MEM_H_ */
|