numam-dpdk

Author	SHA1	Message	Date
Yongseok Koh	974f1e7ef1	net/mlx5: add new memory region support This is the new design of Memory Region (MR) for mlx PMD, in order to: - Accommodate the new memory hotplug model. - Support non-contiguous Mempool. There are multiple layers for MR search. L0 is to look up the last-hit entry which is pointed by mr_ctrl->mru (Most Recently Used). If L0 misses, L1 is to look up the address in a fixed-sized array by linear search. L0/L1 is in an inline function - mlx5_mr_lookup_cache(). If L1 misses, the bottom-half function is called to look up the address from the bigger local cache of the queue. This is L2 - mlx5_mr_addr2mr_bh() and it is not an inline function. Data structure for L2 is the Binary Tree. If L2 misses, the search falls into the slowest path which takes locks in order to access global device cache (priv->mr.cache) which is also a B-tree and caches the original MR list (priv->mr.mr_list) of the device. Unless the global cache is overflowed, it is all-inclusive of the MR list. This is L3 - mlx5_mr_lookup_dev(). The size of the L3 cache table is limited and can't be expanded on the fly due to deadlock. Refer to the comments in the code for the details - mr_lookup_dev(). If L3 is overflowed, the list will have to be searched directly bypassing the cache although it is slower. If L3 misses, a new MR for the address should be created - mlx5_mr_create(). When it creates a new MR, it tries to register adjacent memsegs as much as possible which are virtually contiguous around the address. This must take two locks - memory_hotplug_lock and priv->mr.rwlock. Due to memory_hotplug_lock, there can't be any allocation/free of memory inside. In the free callback of the memory hotplug event, freed space is searched from the MR list and corresponding bits are cleared from the bitmap of MRs. This can fragment a MR and the MR will have multiple search entries in the caches. Once there's a change by the event, the global cache must be rebuilt and all the per-queue caches will be flushed as well. If memory is frequently freed in run-time, that may cause jitter on dataplane processing in the worst case by incurring MR cache flush and rebuild. But, it would be the least probable scenario. To guarantee the most optimal performance, it is highly recommended to use an EAL option - '--socket-mem'. Then, the reserved memory will be pinned and won't be freed dynamically. And it is also recommended to configure per-lcore cache of Mempool. Even though there're many MRs for a device or MRs are highly fragmented, the cache of Mempool will be much helpful to reduce misses on per-queue caches anyway. '--legacy-mem' is also supported. Signed-off-by: Yongseok Koh <yskoh@mellanox.com>	2018-05-14 22:31:51 +01:00
Yongseok Koh	d561b5dc13	net/mlx5: remove memory region support This patch removes current support of Memory Region (MR) in order to accommodate the dynamic memory hotplug patch. This patch can be compiled but traffic can't flow and HW will raise faults. Subsequent patches will add new MR support. Signed-off-by: Yongseok Koh <yskoh@mellanox.com>	2018-05-14 22:31:51 +01:00
Yongseok Koh	df428ceef4	net/mlx5: change device reference for secondary process rte_eth_devices[] is not shared between primary and secondary process, but a static array to each process. The reverse pointer of device (priv->dev) is invalid. Instead, priv has the pointer to shared data of the device, struct rte_eth_dev_data *dev_data; Two macros are added, #define PORT_ID(priv) ((priv)->dev_data->port_id) #define ETH_DEV(priv) (&rte_eth_devices[PORT_ID(priv)]) Signed-off-by: Yongseok Koh <yskoh@mellanox.com>	2018-05-14 22:31:51 +01:00
Yongseok Koh	a2ceae5940	net/mlx5: fix alignment of memory region The memory region is [start, end), so if the memseg of 'end' isn't allocated yet, the returned memseg will have zero entries and this will make 'end' zero (nil). Fixes: `718e35999c` ("net/mlx5: use virt2memseg instead of iteration") Signed-off-by: Yongseok Koh <yskoh@mellanox.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>	2018-04-27 15:54:56 +01:00
Anatoly Burakov	66cc45e293	mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:39 +02:00
Anatoly Burakov	718e35999c	net/mlx5: use virt2memseg instead of iteration Reduce dependency on internal details of EAL memory subsystem, and simplify code. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:02 +02:00
Shahaf Shuler	5feecc57d9	align SPDX Mellanox copyrights Aligning Mellanox SPDX copyrights to a single format. In addition replace to SPDX licence files which were missed. Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-04-11 01:47:47 +02:00
Nélio Laranjeiro	a170a30d22	net/mlx5: use dynamic logging Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	0f99970b4a	net/mlx5: use port id in PMD log Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	a6d83b6a92	net/mlx5: standardize on negative errno values Set rte_errno systematically as well. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	925061b58b	net/mlx5: change non failing function return values These functions return int although they are not supposed to fail, resulting in unnecessary checks in their callers. Some are returning error where is should be a boolean. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	af4f09f282	net/mlx5: prefix all functions with mlx5 This change removes the need to distinguish unlocked priv_() functions which are therefore renamed using a mlx5_() prefix for consistency. At the same time, all functions from mlx5 uses a pointer to the ETH device instead of the one to the PMD private data. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	7b2423cd2e	net/mlx5: remove control path locks In priv struct only the memory region needs to be protected against concurrent access between the control plane and the data plane. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	0b3456e391	net/mlx5: remove useless empty lines Some empty lines have been added in the middle of the code without any reason. This commit removes them. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	fb732b0a49	net/mlx5: add missing function documentation Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	c9e88d35da	net/mlx5: normalize function prototypes Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Nélio Laranjeiro	56f08e1671	net/mlx5: mark parameters with unused attribute Replaces all (void)foo; by __rte_unused macro except when variables are under #if statements. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-03-30 14:08:44 +02:00
Olivier Matz	8fd92a66c6	net/mlx5: use SPDX tags in 6WIND copyrighted files Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-02-01 02:32:52 +01:00
Nelio Laranjeiro	0e83b8e536	net/mlx5: move rdma-core calls to separate file This lays the groundwork for externalizing rdma-core as an optional run-time dependency instead of a mandatory one. No functional change. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-01-31 20:57:29 +01:00
Shahaf Shuler	a482a41a63	net/mlx5: fix secondary process mempool registration Secondary process is not allowed to register mempools on the flight. The code will return invalid memory key for such case. Fixes: `87ec44ce16` ("net/mlx5: add operations for secondary process") Cc: stable@dpdk.org Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Signed-off-by: Xueming Li <xuemingl@mellanox.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>	2018-01-29 10:04:28 +01:00
Yongseok Koh	f81ec74843	net/mlx5: fix memory region lookup This patch reverts: commit `3a6f2eb8c5` ("net/mlx5: fix Memory Region registration") Although granularity of chunks in a mempool is a cacheline, addresses are extended to align to page boundary for performance reason in device when registering a MR (Memory Region). This could make some regions overlap, then can cause Tx completion error due to incorrect LKEY search. If the error occurs, the Tx queue will get stuck. It is because buffer address is compared against aligned addresses for Memory Region. Saving original addresses of mempool for comparison doesn't create any overlap. Fixes: `b0b0938457` ("net/mlx5: use buffer address for LKEY search") Fixes: `3a6f2eb8c5` ("net/mlx5: fix Memory Region registration") Cc: stable@dpdk.org Reported-by: Xueming Li <xuemingl@mellanox.com> Signed-off-by: Xueming Li <xuemingl@mellanox.com> Signed-off-by: Yongseok Koh <yskoh@mellanox.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>	2018-01-21 15:51:52 +01:00
Nélio Laranjeiro	6e78005a9b	net/mlx5: add reference counter on DPDK Tx queues Use the same design for DPDK queue as for Verbs queue for symmetry, this also helps in fixing some issues like the DPDK release queue API which is not expected to fail. With such design, the queue is released when the reference counters reaches 0. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Yongseok Koh <yskoh@mellanox.com>	2017-10-12 01:36:58 +01:00
Nélio Laranjeiro	f8fb87d51f	net/mlx5: add reference counter on memory region This patch introduce the Memory region as a shared object where users should get a reference to it by calling the priv_mr_get() or priv_mr_new() to create the memory region. This last one will register the memory pool in the kernel driver and retrieve the associated memory region. This should help to reduce the memory consumption cause by registering multiple times the same memory pool. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Yongseok Koh <yskoh@mellanox.com>	2017-10-12 01:36:58 +01:00
Nélio Laranjeiro	991b04f682	net/mlx5: prefix Tx structures and functions Prefix struct txq_ctrl and associated function with mlx5. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Yongseok Koh <yskoh@mellanox.com>	2017-10-12 01:36:58 +01:00
Shachar Beiser	6b30a6a855	net/mlx5: replace network to host macros Signed-off-by: Shachar Beiser <shacharbe@mellanox.com> Acked-by: Yongseok Koh <yskoh@mellanox.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>	2017-10-06 02:49:48 +02:00
Nélio Laranjeiro	d052f5358b	net/mlx5: remove pedantic pragma Those are useless since DPDK headers have been cleaned up. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Acked-by: Shahaf Shuler <shahafs@mellanox.com>	2017-10-06 02:49:47 +02:00
Yongseok Koh	b0b0938457	net/mlx5: use buffer address for LKEY search When searching LKEY, if search key is mempool pointer, the 2nd cacheline has to be accessed and it even requires to check whether a buffer is indirect per every search. Instead, using address for search key can reduce cycles taken. And caching the last hit entry is beneficial as well. Signed-off-by: Yongseok Koh <yskoh@mellanox.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>	2017-07-07 11:50:02 +02:00
Bruce Richardson	fc5b160f3c	net/mlx: fix debug build with gcc 6.1 With recent gcc versions, e.g. gcc 6.1, compilation of mlx drivers with debug enabled produces lots of errors complaining that "pedantic" is not a warning level that can be ignored. error: ‘-pedantic’ is not an option that controls warnings [-Werror=pragmas] #pragma GCC diagnostic ignored "-pedantic" ^~~~~~~~~~~ These errors can be removed by changing the "-pedantic" to "-Wpedantic". Fixes: `7fae69eeff` ("mlx4: new poll mode driver") Fixes: `771fa900b7` ("mlx5: introduce new driver for Mellanox ConnectX-4 adapters") Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2016-09-30 12:27:18 +02:00
Nélio Laranjeiro	1d88ba1719	net/mlx5: refactor Tx data path Bypass Verbs to improve Tx performance. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Signed-off-by: Yaacov Hazan <yaacovh@mellanox.com> Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2016-06-27 16:17:52 +02:00
Nélio Laranjeiro	21c8bb4928	net/mlx5: split Tx queue structure To keep the data path as efficient as possible, move fields only useful to the control path into new structure txq_ctrl. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2016-06-27 16:17:52 +02:00
Nélio Laranjeiro	491770fafc	net/mlx5: split memory registration function Except for the first time when memory registration occurs, the lkey is always cached. Since memory registration is slow and performs system calls, performance can be improved by moving that code to its own function outside of the data path so only the lookup code is left in the original inlined function. Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2016-06-27 16:17:51 +02:00

31 Commits