numam-dpdk

Author	SHA1	Message	Date
Gagandeep Singh	47caefc163	eal: increase maximum different hugepage sizes on Arm ARM is supporting maximum 4 hugepage sizes (64K, 2M, 32M and 1G) when granule is 4KB since very long and DPDK support maximum 3 hugepage sizes. With all 4 hugepage sizes enabled, applications and some stacks like VPP which are working over DPDK and using "in-memory" eal option, or using separate mount points on ARM based platform, fails at huge page initialization, reporting error messages from eal: EAL: FATAL: Cannot get hugepage information. EAL: Cannot get hugepage information. EAL: Error - exiting with code: 1 This issue is originated from Linux 5.0 (a21b0b78eaf7 "arm64: hugetlb: Register hugepages during arch init") where kernel is by default creating directories for each supported hugepage size in /sys/kernel/mm/hugepages/ On earlier Stable Kernel LTR's, the directories visible in /sys/kernel/mm/hugepages/ were dependent upon what hugepage sizes are configured at boot time. This change increases the maximum supported hugepage sizes to 4 for ARM based platforms. Cc: stable@dpdk.org Signed-off-by: Gagandeep Singh <g.singh@nxp.com> Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>	2019-08-08 17:25:14 +02:00
David Marchand	c3568ea376	eal: restrict control threads to startup CPU affinity Spawning the ctrl threads on anything that is not part of the eal coremask is not that polite to the rest of the system, especially when you took good care to pin your processes on cpu resources with tools like taskset (linux) / cpuset (freebsd). Rather than introduce yet another eal options to control on which cpu those ctrl threads are created, let's take the startup cpu affinity as a reference and remove the eal coremask from it. If no cpu is left, then we default to the master core. The cpuset is computed once at init before the original cpu affinity is lost. Introduced a RTE_CPU_AND macro to abstract the differences between linux and freebsd respective macros. Examples in a 4 cores FreeBSD vm: $ ./build/app/testpmd -l 2,3 --no-huge --no-pci -m 512 \ -- -i --total-num-mbufs=2048 $ procstat -S 1057 PID TID COMM TDNAME CPU CSID CPU MASK 1057 100131 testpmd - 2 1 2 1057 100140 testpmd eal-intr-thread 1 1 0-1 1057 100141 testpmd rte_mp_handle 1 1 0-1 1057 100142 testpmd lcore-slave-3 3 1 3 $ cpuset -l 1,2,3 ./build/app/testpmd -l 2,3 --no-huge --no-pci -m 512 \ -- -i --total-num-mbufs=2048 $ procstat -S 1061 PID TID COMM TDNAME CPU CSID CPU MASK 1061 100131 testpmd - 2 2 2 1061 100144 testpmd eal-intr-thread 1 2 1 1061 100145 testpmd rte_mp_handle 1 2 1 1061 100147 testpmd lcore-slave-3 3 2 3 $ cpuset -l 2,3 ./build/app/testpmd -l 2,3 --no-huge --no-pci -m 512 \ -- -i --total-num-mbufs=2048 $ procstat -S 1065 PID TID COMM TDNAME CPU CSID CPU MASK 1065 100131 testpmd - 2 2 2 1065 100148 testpmd eal-intr-thread 2 2 2 1065 100149 testpmd rte_mp_handle 2 2 2 1065 100150 testpmd lcore-slave-3 3 2 3 Fixes: `d651ee4919` ("eal: set affinity for control threads") Cc: stable@dpdk.org Signed-off-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com>	2019-03-07 19:21:28 +01:00
Anatoly Burakov	66d9f61de0	eal: fix strdup usages in internal config Currently, we use strdup in a few places to store command-line parameter values for certain internal config values. There are several issues with that. First of all, they're never freed, so memory ends up leaking either after EAL exit, or when these command-line options are supplied multiple times. Second of all, they're defined as `const char `, so they cannot* be freed even if we wanted to. Finally, strdup may return NULL, which will be stored in the config. For most fields, NULL is a valid value, but for the default prefix, the value is always expected to be valid. To fix all of this, three things are done. First, we change the definitions of these values to `char ` as opposed to `const char `. This does not break the ABI, and previous code assumes constness (which is more restrictive), so it's safe to do so. Then, fix all usages of strdup to check return value, and add a cleanup function that will free the memory occupied by these strings, as well as freeing them before assigning a new value to prevent leaks when parameter is specified multiple times. And finally, add an internal API to query hugefile prefix, so that, absent of a valid value, a default value will be returned, and also fix up all usages of hugefile prefix to use this API instead of accessing hugefile prefix directly. Bugzilla ID: 108 Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-01-14 15:05:19 +01:00
Jim Harris	476c847ab6	malloc: add option --match-allocations SPDK uses the rte_mem_event_callback_register API to create RDMA memory regions (MRs) for newly allocated regions of memory. This is used in both the SPDK NVMe-oF target and the NVMe-oF host driver. DPDK creates internal malloc_elem structures for these allocated regions. As users malloc and free memory, DPDK will sometimes merge malloc_elems that originated from different allocations that were notified through the registered mem_event callback routine. This results in subsequent allocations that can span across multiple RDMA MRs. This requires SPDK to check each DPDK buffer to see if it crosses an MR boundary, and if so, would have to add considerable logic and complexity to describe that buffer before it can be accessed by the RNIC. It is somewhat analagous to rte_malloc returning a buffer that is not IOVA-contiguous. As a malloc_elem gets split and some of these elements get freed, it can also result in DPDK sending an RTE_MEM_EVENT_FREE notification for a subset of the original RTE_MEM_EVENT_ALLOC notification. This is also problematic for RDMA memory regions, since unregistering the memory region is all-or-nothing. It is not possible to unregister part of a memory region. To support these types of applications, this patch adds a new --match-allocations EAL init flag. When this flag is specified, malloc elements from different hugepage allocations will never be merged. Memory will also only be freed back to the system (with the requisite memory event callback) exactly as it was originally allocated. Since part of this patch is extending the size of struct malloc_elem, we also fix up the malloc autotests so they do not assume its size exactly fits in one cacheline. Signed-off-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-12-20 13:01:08 +01:00
Santosh Shukla	783667c9f9	eal: add --iova-mode option In the case of user don't want to use bus iova scheme and want to override. For that, adding EAL option --iova-mode=<string> where valid input string is 'pa' or 'va'. Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Signed-off-by: Eric Zhang <eric.zhang@windriver.com> Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-10-28 23:41:26 +01:00
Anatoly Burakov	14de8734c4	eal: add --in-memory option This command-line option will cause DPDK to operate entirely in memory and not create any shared files at runtime, including any shared configuration or hugetlbfs files. This is useful for debug purposes, as well as for certain use cases like containers or automatic memory cleanup. Currently, this option acts as a strict superset of --no-shconf and --huge-unlink commands. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 15:35:26 +02:00
Anatoly Burakov	e4348122a4	eal: add option to limit memory allocation on sockets Previously, it was possible to limit maximum amount of memory allowed for allocation by creating validator callbacks. Although a powerful tool, it's a bit of a hassle and requires modifying the application for it to work with DPDK example applications. Fix this by adding a new parameter "--socket-limit", with syntax similar to "--socket-mem", which would set per-socket memory allocation limits, and set up a default validator callback to deny all allocations above the limit. This option is incompatible with legacy mode, as validator callbacks are not supported there. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:44:15 +02:00
Anatoly Burakov	cb97d93e9d	mem: share hugepage info primary and secondary Since we are going to need to map hugepages in both primary and secondary processes, we need to know where we should look for hugetlbfs mountpoints. So, share those with secondary processes, and map them on init. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	2a04139f66	eal: add single file segments option Currently, DPDK stores all pages as separate files in hugetlbfs. This option will allow storing all pages in one file (one file per memseg list). We do this by using fallocate() calls on FreeBSD, however this is only supported on fairly recent (4.3+) kernels, so ftruncate() fallback is provided to grow (but not shrink) hugepage files. Naming scheme is deterministic, so both primary and secondary processes will be able to easily map needed files and offsets. For multi-file segments, we can close fd's right away. For single-file segments, we can reuse the same fd and reduce the amount of fd's needed to map/use hugepages. However, we need to store the fd's somewhere, so we add a tailq. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	66cc45e293	mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:39 +02:00
Anatoly Burakov	182cf0c28d	eal: add legacy memory option This adds a "--legacy-mem" command-line switch. It will be used to go back to the old memory behavior, one where we can't dynamically allocate/free memory (the downside), but one where the user can get physically contiguous memory, like before (the upside). For now, nothing but the legacy behavior exists, non-legacy memory init sequence will be added later. For FreeBSD, non-legacy memory init will never be enabled, while for Linux, it is disabled in this patch to avoid breaking bisect, but will be enabled once non-legacy mode will be fully operational. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:13 +02:00
Anatoly Burakov	a99c96e96a	eal: add internal flag of init completed Currently, primary process initialization is finalized by setting the RTE_MAGIC value in the shared config. However, it is not possible to check whether secondary process initialization has completed. Add such a value to internal config. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-03-21 18:42:34 +01:00
Hemant Agrawal	96fd032ba8	eal: prefix mbuf pool ops name with user defined This patch prefix the mbuf pool ops name with "user" to indicate that it is user defined. Signed-off-by: Hemant Agrawal <hemant.agrawal@nxp.com> Acked-by: Olivier Matz <olivier.matz@6wind.com> Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>	2018-01-29 18:52:07 +01:00
Bruce Richardson	369991d997	lib: use SPDX tag for Intel copyright files Replace the BSD license header with the SPDX tag for files with only an Intel copyright on them. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>	2018-01-04 22:41:39 +01:00
Jianfeng Tan	f26ab687a7	eal: remove Xen dom0 support We remove xen-specific code in EAL, including the option --xen-dom0, memory initialization code, compiling dependency, etc. Related documents are removed or updated, and bump the eal library version. Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>	2017-10-09 01:54:29 +02:00
Santosh Shukla	a103a97e71	eal: allow user to override default mempool driver DPDK has support for both sw and hw mempool and currently user is limited to use ring_mp_mc pool. In case user want to use other pool handle, need to update config RTE_MEMPOOL_OPS_DEFAULT, then build and run with desired pool handle. Introducing eal option to override default pool handle. Now user can override the RTE_MEMPOOL_OPS_DEFAULT by passing pool handle to eal `--mbuf-pool-ops-name=""`. Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com> Acked-by: Olivier Matz <olivier.matz@6wind.com>	2017-10-06 20:48:22 +02:00
Olivier Matz	9348ca1602	eal: remove log level from internal config This field is only used in the initialization phase. Remove it since the global log level can also be retrieved using a public API: rte_log_get_global_level(). Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>	2017-04-20 01:29:11 +02:00
Thomas Monjalon	e6e6440d33	doc: fix doxygen syntax of some comments Some comments have a wrong space between /** and <. Seen with git grep '\\ <' Reported-by: David Marchand <david.marchand@6wind.com> Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com>	2015-11-04 11:56:37 +01:00
Shesha Sreenivasamurthy	9e21671599	eal: add option to delete hugepage backing files When an application using huge-pages crash or exists, the hugetlbfs backing files are not cleaned up. This is a patch to clean those files. There are multi-process DPDK applications that may be benefited by those backing files. Therefore, I have made that configurable so that the application that does not need those backing files can remove them, thus not changing the current default behavior. The application itself can clean it up, however the rationale behind DPDK cleaning it up is, DPDK created it and therefore, it is better it unlinks it. Signed-off-by: Shesha Sreenivasamurthy <shesha@cisco.com> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>	2015-11-04 02:00:28 +01:00
Michael Qiu	3736db4f95	eal: fix build for 32-bit system lib/librte_eal/linuxapp/eal/eal_memory.c:324:4: error: comparison is always false due to limited range of data type [-Werror=type-limits] \|\| (hugepage_sz == RTE_PGSIZE_16G)) { ^ This was introuduced by commit `b77b5639`: mem: add huge page sizes for IBM Power The root cause is that size_t is 32-bit in i686 platform, but RTE_PGSIZE_16M and RTE_PGSIZE_16G are always 64-bit. Force hugepage_sz to always 64-bit to avoid this issue. Signed-off-by: Michael Qiu <michael.qiu@intel.com> Suggested-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Thomas Monjalon <thomas.monjalon@6wind.com>	2014-12-11 01:42:02 +01:00
Thomas Monjalon	341befa2a2	eal: factorize internal config reset Now that internal config structure is common to Linux and BSD, we can have a common function to initialize it. Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2014-11-25 13:33:31 +01:00
Thomas Monjalon	33e25b3394	eal: fix header guards Some guards are missing or have a wrong name. Others have LINUXAPP in their name but are now common. Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2014-11-25 13:30:23 +01:00
Thomas Monjalon	8828a3210c	eal: factorize common headers No need to have different headers for Linux and BSD. These files are identicals with exception of internal config which has uio and vfio fields only useful for Linux. Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2014-11-25 13:16:24 +01:00

23 Commits