doc: describe optimizations using C11 atomic builtins
Add information about possible optimizations using C11 atomic builtins. Signed-off-by: Phil Yang <phil.yang@arm.com> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
This commit is contained in:
parent
9683022930
commit
703a62a602
@ -167,7 +167,13 @@ but with the added cost of lower throughput.
|
|||||||
Locks and Atomic Operations
|
Locks and Atomic Operations
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
Atomic operations imply a lock prefix before the instruction,
|
This section describes some key considerations when using locks and atomic
|
||||||
|
operations in the DPDK environment.
|
||||||
|
|
||||||
|
Locks
|
||||||
|
~~~~~
|
||||||
|
|
||||||
|
On x86, atomic operations imply a lock prefix before the instruction,
|
||||||
causing the processor's LOCK# signal to be asserted during execution of the following instruction.
|
causing the processor's LOCK# signal to be asserted during execution of the following instruction.
|
||||||
This has a big impact on performance in a multicore environment.
|
This has a big impact on performance in a multicore environment.
|
||||||
|
|
||||||
@ -176,6 +182,57 @@ It can often be replaced by other solutions like per-lcore variables.
|
|||||||
Also, some locking techniques are more efficient than others.
|
Also, some locking techniques are more efficient than others.
|
||||||
For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
|
For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
|
||||||
|
|
||||||
|
Atomic Operations: Use C11 Atomic Builtins
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
DPDK generic rte_atomic operations are implemented by __sync builtins. These
|
||||||
|
__sync builtins result in full barriers on aarch64, which are unnecessary
|
||||||
|
in many use cases. They can be replaced by __atomic builtins that conform to
|
||||||
|
the C11 memory model and provide finer memory order control.
|
||||||
|
|
||||||
|
So replacing the rte_atomic operations with __atomic builtins might improve
|
||||||
|
performance for aarch64 machines.
|
||||||
|
|
||||||
|
Some typical optimization cases are listed below:
|
||||||
|
|
||||||
|
Atomicity
|
||||||
|
^^^^^^^^^
|
||||||
|
|
||||||
|
Some use cases require atomicity alone, the ordering of the memory operations
|
||||||
|
does not matter. For example, the packet statistics counters need to be
|
||||||
|
incremented atomically but do not need any particular memory ordering.
|
||||||
|
So, RELAXED memory ordering is sufficient.
|
||||||
|
|
||||||
|
One-way Barrier
|
||||||
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Some use cases allow for memory reordering in one way while requiring memory
|
||||||
|
ordering in the other direction.
|
||||||
|
|
||||||
|
For example, the memory operations before the spinlock lock are allowed to
|
||||||
|
move to the critical section, but the memory operations in the critical section
|
||||||
|
are not allowed to move above the lock. In this case, the full memory barrier
|
||||||
|
in the compare-and-swap operation can be replaced with ACQUIRE memory order.
|
||||||
|
On the other hand, the memory operations after the spinlock unlock are allowed
|
||||||
|
to move to the critical section, but the memory operations in the critical
|
||||||
|
section are not allowed to move below the unlock. So the full barrier in the
|
||||||
|
store operation can use RELEASE memory order.
|
||||||
|
|
||||||
|
Reader-Writer Concurrency
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Lock-free reader-writer concurrency is one of the common use cases in DPDK.
|
||||||
|
|
||||||
|
The payload or the data that the writer wants to communicate to the reader,
|
||||||
|
can be written with RELAXED memory order. However, the guard variable should
|
||||||
|
be written with RELEASE memory order. This ensures that the store to guard
|
||||||
|
variable is observable only after the store to payload is observable.
|
||||||
|
|
||||||
|
Correspondingly, on the reader side, the guard variable should be read
|
||||||
|
with ACQUIRE memory order. The payload or the data the writer communicated,
|
||||||
|
can be read with RELAXED memory order. This ensures that, if the store to
|
||||||
|
guard variable is observable, the store to payload is also observable.
|
||||||
|
|
||||||
Coding Considerations
|
Coding Considerations
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user