doc: describe optimizations using C11 atomic builtins

Add information about possible optimizations using C11 atomic builtins. Signed-off-by: Phil Yang <phil.yang@arm.com> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
2020-07-17 18:14:35 +08:00 · 2020-07-17 18:14:35 +08:00 · 703a62a602
commit 703a62a602
parent 9683022930
1 changed files with 58 additions and 1 deletions
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@ -167,7 +167,13 @@ but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
 operations in the DPDK environment.
 Locks
 ~~~~~
 On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
@ -176,6 +182,57 @@ It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 Atomic Operations: Use C11 Atomic Builtins
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 DPDK generic rte_atomic operations are implemented by __sync builtins. These
 __sync builtins result in full barriers on aarch64, which are unnecessary
 in many use cases. They can be replaced by __atomic builtins that conform to
 the C11 memory model and provide finer memory order control.
 So replacing the rte_atomic operations with __atomic builtins might improve
 performance for aarch64 machines.
 Some typical optimization cases are listed below:
 Atomicity
 ^^^^^^^^^
 Some use cases require atomicity alone, the ordering of the memory operations
 does not matter. For example, the packet statistics counters need to be
 incremented atomically but do not need any particular memory ordering.
 So, RELAXED memory ordering is sufficient.
 One-way Barrier
 ^^^^^^^^^^^^^^^
 Some use cases allow for memory reordering in one way while requiring memory
 ordering in the other direction.
 For example, the memory operations before the spinlock lock are allowed to
 move to the critical section, but the memory operations in the critical section
 are not allowed to move above the lock. In this case, the full memory barrier
 in the compare-and-swap operation can be replaced with ACQUIRE memory order.
 On the other hand, the memory operations after the spinlock unlock are allowed
 to move to the critical section, but the memory operations in the critical
 section are not allowed to move below the unlock. So the full barrier in the
 store operation can use RELEASE memory order.
 Reader-Writer Concurrency
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 Lock-free reader-writer concurrency is one of the common use cases in DPDK.
 The payload or the data that the writer wants to communicate to the reader,
 can be written with RELAXED memory order. However, the guard variable should
 be written with RELEASE memory order. This ensures that the store to guard
 variable is observable only after the store to payload is observable.
 Correspondingly, on the reader side, the guard variable should be read
 with ACQUIRE memory order. The payload or the data the writer communicated,
 can be read with RELAXED memory order. This ensures that, if the store to
 guard variable is observable, the store to payload is also observable.
 Coding Considerations
 ---------------------