numam-dpdk/lib
Zhihong Wang f5472703c0 eal: optimize aligned memcpy on x86
This patch optimizes rte_memcpy for well aligned cases, where both
dst and src addr are aligned to maximum MOV width. It introduces a
dedicated function called rte_memcpy_aligned to handle the aligned
cases with simplified instruction stream. The existing rte_memcpy
is renamed as rte_memcpy_generic. The selection between them 2 is
done at the entry of rte_memcpy.

The existing rte_memcpy is for generic cases, it handles unaligned
copies and make store aligned, it even makes load aligned for micro
architectures like Ivy Bridge. However alignment handling comes at
a price: It adds extra load/store instructions, which can cause
complications sometime.

DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
The copy is aligned, and remote, and there is header write along
which is also remote. In this case the memcpy instruction stream
should be simplified, to reduce extra load/store, therefore reduce
the probability of load/store buffer full caused pipeline stall, to
let the actual memcpy instructions be issued and let H/W prefetcher
goes to work as early as possible.

This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
from 64 to 1500 bytes.

The test can also be conducted without NIC, by setting loopback
traffic between Virtio and Vhost. For example, modify the macro
TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
rebuild and start testpmd in both host and guest, then "start" on
one side and "start tx_first 32" on the other.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Lei Yao <lei.a.yao@intel.com>
2017-01-17 16:40:05 +01:00
..
librte_acl lib: use C99 syntax for zero-size arrays 2016-09-13 15:35:28 +02:00
librte_cfgfile cfgfile: fix API comments 2016-11-06 23:34:40 +01:00
librte_cmdline cmdline: add alignment constraint 2016-12-23 10:19:11 +01:00
librte_compat compat: remove unneeded macro 2015-06-29 16:41:23 +02:00
librte_cryptodev cryptodev: clarify how operations affect buffers 2016-11-07 00:22:58 +01:00
librte_distributor distributor: remove inclusion of mbuf header 2015-05-11 15:36:37 +02:00
librte_eal eal: optimize aligned memcpy on x86 2017-01-17 16:40:05 +01:00
librte_ether ethdev: fix port data mismatched in multiple process model 2017-01-17 09:20:18 +01:00
librte_hash hash: fix bucket size usage 2016-10-12 18:40:52 +02:00
librte_ip_frag ip_frag: fix IP reassembly regression 2016-11-07 21:27:50 +01:00
librte_jobstats jobstats: fix typo in a comment 2016-06-30 18:51:20 +02:00
librte_kni doc: fix typos 2016-11-07 21:50:27 +01:00
librte_kvargs kvargs: make pointers in string arrays const 2017-01-13 19:28:26 +01:00
librte_lpm lpm: fix freeing memory 2016-11-06 23:46:03 +01:00
librte_mbuf mbuf: add a function to linearize a packet 2017-01-15 19:30:00 +01:00
librte_mempool mempool: use cache in single producer or consumer mode 2017-01-13 16:38:09 +01:00
librte_meter meter: fix excess token bucket update in srtcm 2016-09-21 22:56:03 +02:00
librte_net ethdev: add Tx preparation 2017-01-04 20:40:15 +01:00
librte_pdump doc: add pdump library to API doxygen 2016-12-06 15:43:13 +01:00
librte_pipeline lib: work around unnamed structs/unions 2016-09-13 15:35:28 +02:00
librte_port port: support file descriptor 2016-10-13 11:42:37 +02:00
librte_power examples/vm_power_manager: remove dependency on internal header 2016-07-11 17:23:32 +02:00
librte_reorder lib: add missing include dependencies 2016-09-13 15:35:28 +02:00
librte_ring doc: fix typos in code comments 2016-12-06 15:25:01 +01:00
librte_sched sched: fix releasing enqueued packets 2016-09-23 21:14:54 +02:00
librte_table table: add cuckoo hash 2016-10-12 22:08:36 +02:00
librte_timer timer: fix lag delay 2016-10-05 12:02:53 +02:00
librte_vhost vhost: allow many vhost-user ports 2017-01-17 09:20:18 +01:00
Makefile ivshmem: remove library and its EAL integration 2016-08-23 12:23:58 +02:00