numam-dpdk

Author	SHA1	Message	Date
Anatoly Burakov	c63a42535a	vfio: fix uninitialized variable Some static analyzers complain about it, even though value is never used if not initialized. To avoid additional false positives about a potential null-pointer dereferences, also add a null-check. Bugzilla ID: 58 Fixes: `ea2dc10668` ("vfio: add multi container support") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:44:56 +02:00
Anatoly Burakov	96712b33af	eal/linux: fix uninitialized value The value is not used, but some static analyzers may give out a warning. Fix it by assigning default value of zero. Bugzilla ID: 58 Fixes: `cdc242f260` ("eal/linux: support running as unprivileged user") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:44:43 +02:00
Anatoly Burakov	462dd3722e	eal/linux: fix invalid syntax in interrupts Parentheses were missing. It worked because macro is enclosed in parentheses, so syntax was valid after macro expansion. Bugzilla ID: 58 Fixes: `0a45657a67` ("pci: rework interrupt handling") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:44:17 +02:00
Anatoly Burakov	e4348122a4	eal: add option to limit memory allocation on sockets Previously, it was possible to limit maximum amount of memory allowed for allocation by creating validator callbacks. Although a powerful tool, it's a bit of a hassle and requires modifying the application for it to work with DPDK example applications. Fix this by adding a new parameter "--socket-limit", with syntax similar to "--socket-mem", which would set per-socket memory allocation limits, and set up a default validator callback to deny all allocations above the limit. This option is incompatible with legacy mode, as validator callbacks are not supported there. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:44:15 +02:00
Anatoly Burakov	0b82bd7b24	memzone: improve zero-length reserve Currently, reserving zero-length memzones is done by looking at malloc statistics, and reserving biggest sized element found in those statistics. This has two issues. First, there is a race condition. The heap is unlocked between the time we check stats, and the time we reserve malloc element for memzone. This may lead to inability to reserve the memzone we wanted to reserve, because another allocation might have taken place and biggest sized element may no longer be available. Second, the size returned by malloc statistics does not include any alignment information, which is worked around by being conservative and subtracting alignment length from the final result. This leads to fragmentation and reserving memzones that could have been bigger but aren't. Fix all of this by using earlier-introduced operation to reserve biggest possible malloc element. This, however, comes with a trade-off, because we can only lock one heap at a time. So, if we check the first available heap and find any element at all, that element will be considered "the biggest", even though other heaps might have bigger elements. We cannot know what other heaps have before we try and allocate it, and it is not a good idea to lock all of the heaps at the same time, so, we will just document this limitation and encourage users to reserve memzones with socket id properly set. Also, fixup unit tests to account for the new behavior. Fixes: `fafcc11985` ("mem: rework memzone to be allocated by malloc") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:27:30 +02:00
Anatoly Burakov	68b6092bd3	malloc: allow reserving biggest element Add an internal-only function to allocate biggest element from the heap. Nominally, it supports SOCKET_ID_ANY as its socket argument, but it's essentially useless because other sockets will only be allocated from if the entire heap on current or specified socket is busy. Still, asking to reserve a biggest element will allow fixing race condition in memzone reserve that has been there for a long time. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Remy Horton <remy.horton@intel.com>	2018-07-13 11:27:27 +02:00
Anatoly Burakov	9fe6bceafd	malloc: add finding biggest free IOVA-contiguous element Adding internal-only function to find biggest free IOVA-contiguous malloc element. This is not exposed to external API. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Remy Horton <remy.horton@intel.com> Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>	2018-07-13 11:23:07 +02:00
Anatoly Burakov	e43a9f52b7	malloc: fix pad erasing Previously, when joining adjacent free elements, we were erasing trailer and header, but did not erase the padding. Fix this by accounting for padding on erase, and do not erase padding twice by adjusting data pointer and data len to not include padding. Fixes: `bb372060da` ("malloc: make heap a doubly-linked list") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:21:30 +02:00
Anatoly Burakov	e26415428f	mem: provide thread-unsafe memseg list walk variant Sometimes, user code needs to walk memseg list while being inside a memory-related callback. Rather than making everyone copy around the same iteration code and depending on DPDK internals, provide an official way to do memseg_list_walk() inside callbacks. Also, remove existing reimplementation from memalloc code and use the new API instead. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:21:25 +02:00
Anatoly Burakov	7c790af08f	mem: provide thread-unsafe memseg walk variant Sometimes, user code needs to walk memseg list while being inside a memory-related callback. Rather than making everyone copy around the same iteration code and depending on DPDK internals, provide an official way to do memseg_walk() inside callbacks. Also, remove existing reimplementation from sPAPR VFIO code and use the new API instead. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:21:15 +02:00
Anatoly Burakov	b917147601	mem: provide thread-unsafe contig walk variant Sometimes, user code needs to walk memseg list while being inside a memory-related callback. Rather than making everyone copy around the same iteration code and depending on DPDK internals, provide an official way to do memseg_contig_walk() inside callbacks. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:20:06 +02:00
Anatoly Burakov	76480e3885	mem: mark pages as freeable on exit When rte_eal_cleanup() is called, it is expected that DPDK will be able to release all of its memory back to the system. However, if pages are marked as unfreeable, the pages will not be released back. Fix this to mark all pages as freeable on calling rte_eal_cleanup(), but only do it for primary process, as secondaries can come and go. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:06:14 +02:00
Anatoly Burakov	179f916e88	mem: allocate in reverse to reduce fragmentation Currently, all hugepages are allocated from lower VA address to higher VA address, while malloc heap allocates from higher VA address to lower VA address. This results in heap fragmentation over time due to multiple reserves leaving small space below the allocated elements. Fix this by allocating VA memory from the top, thereby reducing fragmentation and lowering overall memory usage. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:04:53 +02:00
Anatoly Burakov	4d2dde26aa	fbarray: add reverse finding of contiguous Add a function to return starting point of current contiguous block, going backwards. All semantics are kept the same as the existing function, with the only difference being that given the same input, results will be returned in reverse order. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:03:44 +02:00
Anatoly Burakov	e1ca5dc862	fbarray: add reverse finding of chunk Add a function to look for N used/free slots, but going backwards instead of forwards. All semantics are kept similar to the existing function, with the difference being that given the same input, the same results will be returned in reverse order. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:03:16 +02:00
Anatoly Burakov	b8d07c5252	fbarray: add reverse finding Add function to look up used/free indexes starting from specified index, but going backwards instead of forward. Semantics are kept similar to the existing function, except for the fact that, given the same input, the results returned will be in reverse order. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:02:39 +02:00
Anatoly Burakov	9777a143ca	fbarray: reduce duplication in element finding Just code move to put all checks and calls in one place. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:02:31 +02:00
Anatoly Burakov	66656d0bf9	fbarray: reduce duplication in chunk finding Mostly code move, aside from more quick checks done to avoid doing computations in obviously hopeless cases. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 10:52:04 +02:00
Anatoly Burakov	594adef0f4	fbarray: reduce duplication in contiguous finding Mostly a code move, to have all code related to find_contig in one place. This slightly changes the API in that previously, calling find_contig_free() on a full fbarray would've been an error, but equivalent call to find_contig_used() on an empty array does not return an error, leading to an inconsistency in the API. The decision was made to not treat this condition as an error, because it is equivalent to calling find_contig() on an index that just happens to be used/free, which is not an error and will return 0. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 10:51:23 +02:00
Anatoly Burakov	a148861aa8	fbarray: fix errno values returned from functions Errno values are supposed to be positive, yet they were negative. This changes API, so not backporting. Fixes: `c44d09811b` ("eal: add shared indexed file-backed array") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 10:48:41 +02:00
Anatoly Burakov	1d406458db	mem: make segment preallocation OS-specific In the perfect world, it wouldn't matter how much memory was preallocated because most of it was always going to be private anonymous zero-page mappings for the duration of the program. However, in practice, due to peculiarities of FreeBSD, we need to additionally limit memory allocation there. This patch moves the segment preallocation to EAL private functions that will be implemented by an OS-specific EAL rather than being in the common memory-related code. Since there is no support for growing/shrinking memory use at runtime on FreeBSD anyway, this does not inhibit any functionality but makes core dumps faster even on default settings. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:59:18 +02:00
Anatoly Burakov	e1589061cc	eal/bsd: concatenate adjacent memory segments Previously, memory allocator always left holes between mapped contigmem segments, even if they were IOVA-contiguous. Fix this by remembering last IOVA address and memseg index, and checking against those when mapping new contigmem segments. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:58:56 +02:00
Anatoly Burakov	953e6913c1	eal/bsd: fix memory segment index display Segment index was set to 0 at start but was never incremented. This has no consequences other than displayed number of segments allocated at initialization. Fix this by incrementing it after displaying. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:58:26 +02:00
Dariusz Stojaczyk	6770a5f8a2	eal: fix return codes on control thread failure This function returned positive error numbers instead of negative ones as desbribed in the doc. What's worse, multiple of its callers only check for (rc < 0) to detect failure. It was incorrectly assumed that pthread_create and pthread_setaffinity_np return negative errnos. They always returns positive ones, so this patch negates their return values before returning. Fixes: `9e5afc72c9` ("eal: add function to create control threads") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com>	2018-07-13 00:27:15 +02:00
Dariusz Stojaczyk	82dcc8b4bc	eal: fix return codes on thread naming failure The doc says this function returns negative errno on error, but it currently returns either -1 or positive errno. It was incorrectly assumed that pthread_setname_np() returns negative error numbers. It always returns positive ones, so this patch negates its return value before returning. Fixes: `3901ed99c2` ("eal: fix thread naming on FreeBSD") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com>	2018-07-13 00:26:22 +02:00
Dariusz Stojaczyk	368a91d6bd	eal: ignore failure of naming a control thread The error is not fatal and we can physically continue creating the thread. It simply won't have a name. If rte_thread_setname() fails, we will just print a debug log now. EAL does the same for lcore threads. Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com>	2018-07-13 00:25:17 +02:00
Dariusz Stojaczyk	6c0fb7547b	mem: do not use --base-virtaddr in secondary processes Since secondary process' address space is highly dictated by the primary process' mappings, it doesn't make much sense to use base-virtaddr for secondary processes. This patch is intended to fix PCI resource mapping in secondary processes using the same base-virtaddr as their primary processes. PCI uses the end of the hugepage memory area to map all resources. [pci_find_max_end_va()] It works for primary processes, but can't be mapped 1:1 by secondary ones, as the same addresses are currently always occupied by shadow memseg lists, which were created with eal_get_virtual_area(NULL, ...). ``` PRIMARY PROCESS 0x6e00e00000 388K rw-s- fbarray_memseg-2048k-1-3 0x6e01000000 16777216K r---- [ anon ] 0x7201000000 16K rw-s- resource0 SECONDARY PROCESS 0x6e00e00000 388K rw-s- fbarray_memseg-2048k-1-3 0x6e01000000 16777216K r---- [ anon ] 0x7201000000 4K rw-s- fbarray_memseg-1048576k-0-0_203213 ``` Fixes: `524e43c2ad` ("mem: prepare memseg lists for multiprocess sync") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:13 +02:00
Dariusz Stojaczyk	9dac150f98	mem: fix alignment requested with --base-virtaddr Whenever a calculated base-virtaddr offset had to be manually aligned to requested page_sz, we did not take account of that alignment in incrementing the base-virtaddr offset further. The next requested virtual area could print a warning "hint [...] not respected!" and let the system pick an address instead. As a result, this breaks secondary process support on many system configurations. Fixes: `b7cc54187e` ("mem: move virtual area function in common directory") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:10 +02:00
Dariusz Stojaczyk	7fa7216ed4	mem: fix alignment of requested virtual areas Although the alignment mechanism works as intended, the `no_align` bool flag was set incorrectly. We were aligning buffers that didn't need extra alignment, and weren't aligning ones that really needed it. Fixes: `b7cc54187e` ("mem: move virtual area function in common directory") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:09 +02:00
Dariusz Stojaczyk	09037cf36c	mem: avoid crash on memseg query with invalid address When trying to use it with an address that's not managed by DPDK it would segfault due to a missing check. The doc says this function returns either a pointer or NULL, so let it do so. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:08 +02:00
Dariusz Stojaczyk	0762c438b8	mem: do not unmap overlapping region on mmap failure This isn't documented in the manuals, but a failed mmap(..., MAP_FIXED) may still unmap overlapping regions. In such case, we need to remap these regions back into our address space to ensure mem contiguity. We do it unconditionally now on mmap failure just to be safe. Verified on Linux 4.9.0-4-amd64. I was getting ENOMEM when trying to map hugetlbfs with no space left, and the previous anonymous mapping was still being removed. Fixes: `582bed1e1d` ("mem: support mapping hugepages at runtime") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:07 +02:00
Dariusz Stojaczyk	637175ab95	mem: do not leave unmapped holes in EAL memory area EAL reserves a huge area in virtual address space to provide virtual address contiguity for e.g. future memory extensions (memory hotplug). During memory hotplug, if the hugepage mmap succeeds but doesn't suffice EAL's requiriments, the EAL would unmap this mapping straight away, leaving a hole in its virtual memory area and making it available to everyone. As EAL still thinks it owns the entire region, it may try to mmap it later with MAP_FIXED, possibly overriding a user's mapping that was made in the meantime. This patch ensures each hole is mapped back by EAL, so that it won't be available to anyone else. Fixes: `582bed1e1d` ("mem: support mapping hugepages at runtime") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:05 +02:00
Honnappa Nagarahalli	7c872b9698	hash: validate hash bucket entries while compiling Validate RTE_HASH_BUCKET_ENTRIES during compilation instead of run time. Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Reviewed-by: Gavin Hu <gavin.hu@arm.com> Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>	2018-07-12 12:43:10 +02:00
Gage Eads	e30dd31847	service: add mechanism for quiescing Existing service functions allow us to stop a service, but doing so doesn't guarantee that the service has finished running on a service core. This commit introduces rte_service_may_be_active(), which returns whether the service may be executing on one or more lcores currently, or definitely is not. The service core layer supports this function by setting a flag when a service core is going to execute a service, and unsetting the flag when the core is no longer able to run the service (its runstate becomes stopped or the lcore is no longer mapped). With this new function, applications can set a service's runstate to stopped, then poll rte_service_may_be_active() until it returns false. At that point, the service is quiesced. Signed-off-by: Gage Eads <gage.eads@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>	2018-07-06 06:54:49 +02:00
Thomas Monjalon	f8e9989606	remove useless constructor headers A constructor is usually declared with RTE_INIT* macros. As it is a static function, no need to declare before its definition. The macro is used directly in the function definition. Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-07-12 00:00:35 +02:00
Erik Gabriel Carrillo	f28f3594de	service: add attribute API Add APIs that allow an application to query and reset the attributes of a service lcore. Add one such new attribute, "loops", which is a counter that tracks the number of times the service core has looped in the service runner function. This is useful to applications that desire a "liveness" check to make sure a service core is not stuck. Signed-off-by: Erik Gabriel Carrillo <erik.g.carrillo@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>	2018-07-11 23:43:23 +02:00
Jananee Parthasarathy	b40ae2be73	cryptodev: remove debug compilation option For cryptodev dynamic logging, conditional compilation of debug logs is not actually required. Signed-off-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com> Reviewed-by: Reshma Pattan <reshma.pattan@intel.com> Reviewed-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>	2018-07-11 00:57:51 +02:00
Ferruh Yigit	64bd384be4	eal: do not enable static log macro for ethdev static logging macro RTE_PMD_DEBUG_TRACE is enabled with a few DEBUG config options, including RTE_LIBRTE_ETHDEV_DEBUG RTE_LIBRTE_ETHDEV_DEBUG is still used for data path logging, but all ethdev logging switched to dynamic logging, so no need to enable static logging macro for ethdev. Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com> Acked-by: Shahaf Shuler <shahafs@mellanox.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-07-03 01:35:58 +02:00
Anatoly Burakov	53fd532e39	ipc: fix locking while sending messages Previously, we were putting an exclusive lock to prevent secondary processes spinning up while we are sending our messages. However, using exclusive locks had an effect of disallowing multiple simultaenous unrelated messages/requests being sent, which was not the intention behind locking. Fix it to put a shared lock on the directory. That way, we still prevent secondary process initializations while sending data over IPC, but allow multiple unrelated transmissions to proceed. Fixes: `89f1fe7e6d` ("eal: lock IPC directory on init and send") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Qi Zhang <qi.z.zhang@intel.com>	2018-06-29 02:01:37 +02:00
David Marchand	0c41aab8e2	log: remove useless intermediate buffer Rather than copy the log message, we can use a precision in the format string given to syslog. Signed-off-by: David Marchand <david.marchand@6wind.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com>	2018-06-27 18:17:56 +02:00
Thomas Monjalon	b259da7b24	version: 18.08-rc0 Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-06-01 12:58:36 +02:00
Thomas Monjalon	a5dce55556	version: 18.05.0 Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-05-30 22:55:57 +02:00
David Marchand	32cb7aee94	mem: add missing newline in callback log Fixes: `56efb4c117` ("malloc: support callbacks on memory events") Signed-off-by: David Marchand <david.marchand@6wind.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-30 21:16:43 +02:00
Thomas Monjalon	830410b265	version: 18.05-rc6 Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-05-28 03:29:40 +02:00
Thomas Monjalon	cc9bedbba6	vfio: fix export of renamed symbols The functions - vfio_get_container_fd - vfio_get_group_fd - vfio_get_group_no have been renamed to - rte_vfio_get_container_fd - rte_vfio_get_group_fd - rte_vfio_get_group_num The old names are removed from the map file. Fixes: `964b2f3bfb` ("vfio: export some internal functions") Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-05-28 03:20:42 +02:00
Olivier Matz	d8cda718e1	eal: deprecate function to set default mbuf pool Deprecate rte_eal_mbuf_default_mempool_ops(), it shall be replaced by rte_mbuf_best_mempool_ops(). Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-05-28 02:31:35 +02:00
Anatoly Burakov	cc789005fc	memzone: clarify support for zero-length memzones Currently, memzone allocation with length set to 0 that are also IOVA-contiguous is not supported. Document this limitation. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-28 02:10:58 +02:00
Anatoly Burakov	0d0ad70175	mem: document callbacks not being supported in some cases Mem event and validator callbacks may not be supported under all circumstances (such as when running in legacy memory mode, or on FreeBSD), and this case needs to be handled by any code that will use these callbacks. Spell this out more clearly, because it's not immediately obvious that this is an expected use case. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-28 02:10:35 +02:00
Pablo de Lara	c18ceab36b	eal: convert dual-license to SPDX Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com> Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>	2018-05-25 10:31:50 +02:00
Hemant Agrawal	0764be65e5	eal: add missing SPDX identifiers Signed-off-by: Hemant Agrawal <hemant.agrawal@nxp.com>	2018-05-25 10:30:44 +02:00
Thomas Monjalon	c9db09ff0e	version: 18.05-rc5 Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-05-23 01:44:00 +02:00
Andy Green	1587d36e22	eal: explicit cast in rwlock functions GCC 8.1 warned: In function 'rte_rwlock_read_lock': rte_rwlock.h:74:12: warning: conversion to 'uint32_t' {aka 'unsigned int'} from 'int32_t' {aka 'int'} may change the sign of the result [-Wsign-conversion] x, x + 1); ^ rte_rwlock.h:74:17: warning: conversion to 'uint32_t' {aka 'unsigned int'} from 'int' may change the sign of the result [-Wsign-conversion] x, x + 1); ~~^~~ In function 'rte_rwlock_write_lock': rte_rwlock.h:110:15: warning: unsigned conversion from 'int' to 'uint32_t' {aka 'unsigned int'} changes value from '-1' to '4294967295' [-Wsign-conversion] 0, -1); ^~ Again in this case we are making explicit the exact cast that was always happening implicitly. The patch does not change the generated code. The int32_t temp "x" is required to be signed to detect a < 0 error condition from the lock status. Afterwards, it has always been implicitly cast to uint32_t when it is used in the arguments to rte_atomic32_cmpset()... gcc8.1 objects to the implicit cast now and requires us to cast it explicitly. Fixes: `af75078fec` ("first public release") Cc: stable@dpdk.org Signed-off-by: Andy Green <andy@warmcat.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-05-22 16:27:01 +02:00
Andy Green	14035e5fad	eal/x86: fix type of variable in memcpy function GCC 8.1 warned: rte_memcpy.h:793:2: note: in expansion of macro 'MOVEUNALIGNED_LEFT47' MOVEUNALIGNED_LEFT47(dst, src, n, srcofs); ^~~~~~~~~~~~~~~~~~~~ rte_memcpy.h:649:51: warning: conversion from 'size_t' {aka 'long unsigned int'} to 'int' may change value [-Wconversion] case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; ^ rte_memcpy.h:616:15: note: in definition of macro 'MOVEUNALIGNED_LEFT47_IMM' tmp = len; ^~~ rte_memcpy.h:793:2: note: in expansion of macro 'MOVEUNALIGNED_LEFT47' MOVEUNALIGNED_LEFT47(dst, src, n, srcofs); ^~~~~~~~~~~~~~~~~~~~ rte_memcpy.h:618:13: warning: conversion to 'size_t' {aka 'long unsigned int'} from 'int' may change the sign of the result [-Wsign-conversion] tmp -= len; ^~ rte_memcpy.h:649:16: note: in expansion of macro 'MOVEUNALIGNED_LEFT47_IMM' case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; ^~~~~~~~~~~~~~~~~~~~~~~~ rte_memcpy.h:793:2: note: in expansion of macro 'MOVEUNALIGNED_LEFT47' MOVEUNALIGNED_LEFT47(dst, src, n, srcofs); ^~~~~~~~~~~~~~~~~~~~ rte_memcpy.h:618:13: warning: conversion to 'size_t' {aka 'long unsigned int'} from 'int' may change the sign of the result [-Wsign-conversion] tmp -= len; ^~ We can eliminate the problems by setting the type of tmp to size_t in the first place. Fixes: `d35cc1fe6a` ("eal/x86: revert select optimized memcpy at run-time") Cc: stable@dpdk.org Suggested-by: Bruce Richardson <bruce.richardson@intel.com> Signed-off-by: Andy Green <andy@warmcat.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-05-22 16:26:03 +02:00
Ferruh Yigit	6ff0f81d0e	log: fix pattern matching loglevel set wrong when ":" is used as separator, like --log-type="user:debug" This is because fnmatch returns zero on success. Fixed fnmatch return value check. Fixes: `7f0bb634a1` ("log: add ability to match log type with globbing") Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>	2018-05-21 15:49:27 +02:00
Adrien Mazarguil	97c228a0aa	eal: fix runtime directory permissions Executable bit must be set on directories for normal users to enter them. This patch addresses the inability to start DPDK applications as non-root due to errors such as: EAL: failed to bind /tmp/dpdk/rte/mp_socket: Permission denied Fixes: `56236363b4` ("eal: add directory for runtime data") Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-05-21 01:08:26 +02:00
Andy Green	c7bf809382	eal: explicit cast in constant byte swap GCC 8.1 warns: rte_byteorder.h: In function 'rte_constant_bswap16': rte_byteorder.h:54:45: warning: conversion from 'int' to 'uint16_t' {aka 'short unsigned int'} may change value [-Wconversion] ((((uint16_t)(v) & UINT16_C(0x00ff)) << 8) \| \ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~ (((uint16_t)(v) & UINT16_C(0xff00)) >> 8)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ rte_byteorder.h:126:9: note: in expansion of macro 'RTE_STATIC_BSWAP16' return RTE_STATIC_BSWAP16(x); ^~~~~~~~~~~~~~~~~~ The other two sizes are going to be afflicted the same, so get the same fix. Fixes: `b75667ef9f` ("eal: add static endianness conversion macros") Cc: stable@dpdk.org Signed-off-by: Andy Green <andy@warmcat.com>	2018-05-21 00:20:05 +02:00
Andy Green	d3db77d7d8	eal: fix casts in random functions GCC 8.1 warns: In function 'rte_srand': rte_random.h:34:10: warning: conversion to 'long int' from 'long unsigned int' may change the sign of the result [-Wsign-conversion] srand48((long unsigned int)seedval); rte_random.h:51:8: warning: conversion to 'uint64_t' {aka 'long unsigned int'} from 'long int' may change the sign of the result [-Wsign-conversion] val = lrand48(); ^~~~~~~ rte_random.h:53:6: warning: conversion to 'long unsigned int' from 'long int' may change the sign of the result [-Wsign-conversion] val += lrand48(); Fixes: `af75078fec` ("first public release") Cc: stable@dpdk.org Signed-off-by: Andy Green <andy@warmcat.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-05-21 00:19:30 +02:00
Andy Green	622a7305d1	eal: explicit cast of strlcpy return GCC 8.1 warns: rte_string_fns.h: In function 'rte_strlcpy': rte_string_fns.h:58:9: warning: conversion to 'size_t' {aka 'long unsigned int'} from 'int' may change the sign of the result [-Wsign-conversion] return snprintf(dst, size, "%s", src); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: `5364de644a` ("eal: support strlcpy function") Signed-off-by: Andy Green <andy@warmcat.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-05-21 00:19:08 +02:00
Thomas Monjalon	08b7521c35	version: 18.05-rc4 Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-05-15 22:38:39 +02:00
Anatoly Burakov	3f697d2ee5	eal: move runtime directory creation after args parsing The intention of the original code was to create runtime data directory as early as possible, however it was moved too early, before the arguments were parsed, resulting in --file-prefix option essentially not working. Fix this by moving eal_create_runtime_dir() to after command line arguments parsing. Fixes: `56236363b4` ("eal: add directory for runtime data") Reported-by: Andrew Rybchenko <arybchenko@solarflare.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>	2018-05-15 15:22:40 +02:00
Ferruh Yigit	ff75dd7d65	version: 18.05-rc3 Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>	2018-05-15 00:03:42 +01:00
Anatoly Burakov	5b18d86dec	eal: move runtime data into dedicated directory Fix all calls to functions in eal_filesystem to produce paths residing inside dedicated DPDK runtime directory. Leaving DPDK runtime config in place as 3rd-party applications within the DPDK ecosystem might rely on this path to determine whether DPDK is running, so moving that will be postponed to the next release cycle. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-15 00:35:12 +02:00
Anatoly Burakov	56236363b4	eal: add directory for runtime data Currently, during runtime, DPDK will store a bunch of files here and there (in /var/run, /tmp or in $HOME). Fix it by creating a DPDK-specific runtime directory, under which all runtime data will be placed. The template for creating this runtime directory is the following: <base path>/dpdk/<DPDK prefix>/ Where <base path> is set to either "/var/run" if run as root, or $XDG_RUNTIME_DIR if run as non-root, with a fallback to /tmp if $XDG_RUNTIME_DIR is not defined. So, for example, if run as root, by default all runtime data will be stored at /var/run/dpdk/rte/. There is no equivalent of "mkdir -p", so we will be creating the path step by step. Nothing uses this new path yet, changes for that will come in next commit. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>	2018-05-15 00:35:08 +02:00
Anatoly Burakov	a2a2e499e5	mem: rename function returning hugepage data path The original name for this path was not too descriptive and confusing. Rename it to a more appropriate and descriptive name: it stores data about hugepages, so name it eal_hugepage_data_path(). Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>	2018-05-15 00:35:02 +02:00
Anatoly Burakov	dcbfbe3c80	eal: remove unused path pattern The define was a leftover from IVSHMEM library. Fixes: `c711ccb309` ("ivshmem: remove library and its EAL integration") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: David Marchand <david.marchand@6wind.com>	2018-05-15 00:34:58 +02:00
Ferruh Yigit	04db1d0da7	lib: clear experimental version tag in linker scripts Remove version tag from experimental block in linker version scripts (.map files). That label is not used by linker and information only. It is useful for version blocks but not useful for experimental block but confusing. Removing those labels. Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com> Acked-by: Neil Horman <nhorman@tuxdriver.com>	2018-05-14 03:37:28 +02:00
Anatoly Burakov	f90e4fcc13	ipc: fix duplicate string copy in async request Coverity issue: 272582 Fixes: `2147c09505` ("ipc: clean up code") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>	2018-05-14 03:19:57 +02:00
Anatoly Burakov	cd3da7cb0e	mem: fix unmapping and marking segments as free Currently, page deallocation might fail if allocator cannot get page fd, which will leave VA space still mapped, and will also not mark page as free. Fix page deallocation function to always unmap space before trying to get rid of the page itself, and always mark page as free even if page deallocation failed. Fixes: `a5ff05d60f` ("mem: support unmapping pages at runtime") Fixes: `1a7dc2252f` ("mem: revert to using flock and add per-segment lockfiles") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>	2018-05-14 03:17:48 +02:00
Anatoly Burakov	3d2d9861a6	mem: fix return code of freeing segment on failure Return value should be zero for success, but if unlock and unlink have succeeded, return value was 1, which triggered failure message in calling code. Fixes: `a5ff05d60f` ("mem: support unmapping pages at runtime") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>	2018-05-14 03:15:36 +02:00
Anatoly Burakov	b3b1b83bad	mem: fix index for unmapping segments on failure Segment index was calculated incorrectly, causing free_seg to attempt to free segments that do not exist. Fixes: `a5ff05d60f` ("mem: support unmapping pages at runtime") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Yong Liu <yong.liu@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>	2018-05-14 03:15:33 +02:00
Anatoly Burakov	d2bd796d7b	mem: fix potential underflow on mem size calculation If total memory is already bigger than max memory, an underflow will occur on subtraction. Fix it by simply stopping whenever we already have amount of memory that is bigger than maximum. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-14 03:15:31 +02:00
Anatoly Burakov	7785c5588d	memzone: document reserving zero-length memzones Currently, reserving a memzone with length set to 0 will not trigger any memory allocations, and memzone will instead be looking through already allocated memory only. Document this limitation. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-14 02:10:36 +02:00
Anatoly Burakov	a82377cab8	memzone: fix size on reserving biggest memzone Size of malloc heap elements include overhead, which should not be counted as part of memzone. Fixes: `fafcc11985` ("mem: rework memzone to be allocated by malloc") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>	2018-05-14 02:10:36 +02:00
Anatoly Burakov	3ff39e25e6	memzone: fix race condition on alloc failure Deallocation used the wrong function, which could have resulted in race conditions because the function does not use locks internally. Fixes: `1403f87d4f` ("malloc: enable memory hotplug support") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>	2018-05-14 01:37:47 +02:00
Anatoly Burakov	68c3603867	mem: unmap unneeded space When we ask to reserve virtual areas, we usually include alignment in the mapping size, and that memory ends up being wasted. Wasting a gigabyte of VA space while trying to reserve one gigabyte is pretty expensive on 32-bit, so after we're done mapping, unmap unneeded space. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-05-14 01:32:38 +02:00
Anatoly Burakov	91fe57ac00	mem: check if allocation size is too big Mapping size is a 64-bit integer, but mmap() will accept size_t for size mappings. A user could request a mapping with an alignment, which would have overflown size_t, so check if (size + alignment) will overflow size_t. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-05-14 01:32:21 +02:00
Thomas Monjalon	c9d034d873	mem: fix typo in local function name Fixes: `582bed1e1d` ("mem: support mapping hugepages at runtime") Signed-off-by: Thomas Monjalon <thomas@monjalon.net> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-14 01:32:14 +02:00
Thomas Monjalon	2b1c4388dd	eal: fix typo in doc of pointer offset macro Fixes: `af75078fec` ("first public release") Cc: stable@dpdk.org Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-05-14 01:32:09 +02:00
Ivan Malov	33b3181791	eal: fix mempool ops name parsing The code aimed to pick and remember the value of mempool ops name from EAL command line arguments does not copy the string and remembers the pointer provided by getopt_long() directly. The latter could be clobbered later and result in reading wrong mbuf pool ops name by rte_mempool library. Typically, this flaw could be avoided by using strdup() to remember the string value of the option. Fixes: `a103a97e71` ("eal: allow user to override default mempool driver") Cc: stable@dpdk.org Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru> Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com> Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>	2018-05-14 01:32:07 +02:00
Andy Green	f7f18e92a5	spinlock/x86: move stack declaration before code In function 'rte_try_tm': rte_spinlock.h:82:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] int retries = RTE_RTM_MAX_RETRIES; Fixes: `ba7468997e` ("spinlock: add HTM lock elision for x86") Cc: stable@dpdk.org Signed-off-by: Andy Green <andy@warmcat.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>	2018-05-13 22:45:21 +02:00
Andy Green	e3908132b7	eal: declare trace buffer at top of own block rte_dev.h:54:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] char buffer[vsnprintf(NULL, 0, fmt, ap) + 1]; Fixes: `b974e4a40c` ("ethdev: make error checking macros public") Cc: stable@dpdk.org Signed-off-by: Andy Green <andy@warmcat.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>	2018-05-13 22:45:16 +02:00
Andy Green	ef3c7b50ff	eal: explicit cast of core id when getting index rte_lcore.h: In function 'rte_lcore_index': rte_lcore.h:122:14: warning: conversion to 'int' from 'unsigned int' may change the sign of the result [-Wsign-conversion] lcore_id = rte_lcore_id(); Fixes: `5583037a79` ("eal: get relative core index") Cc: stable@dpdk.org Signed-off-by: Andy Green <andy@warmcat.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>	2018-05-13 22:45:12 +02:00
Andy Green	54a93341cc	eal: explicit cast of builtin for bsf32 rte_common.h:416:9: warning: conversion to 'uint32_t' {aka 'unsigned int'} from 'int' may change the sign of the result [-Wsign-conversion] return __builtin_ctz(v); ^~~~~~~~~~~~~~~~ The builtin is defined to return int, but we want to return it as uint32_t. Its only defined valid return values are positive integers or zero, which is OK for uint32_t. So just add an explicit cast. Fixes: `03f6bced5b` ("eal: use intrinsic function") Cc: stable@dpdk.org Signed-off-by: Andy Green <andy@warmcat.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>	2018-05-13 22:45:05 +02:00
Anatoly Burakov	0256386dc4	mem: add argument to memory event callback It may be useful to pass arbitrary data to the callback (such as device pointers), so add this to the mem event callback API. Suggested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-05-08 22:28:58 +02:00
Thomas Monjalon	7baac77594	version: 18.05-rc2 Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-05-02 23:12:16 +02:00
Konstantin Ananyev	51c7de38e2	eal/x86: fix atomic exchange for 32-bit Should break out of loop when rte_atomic64_cmpset() returns non-zero. Fixes: `ff2863570f` ("eal: introduce atomic exchange operation") Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com> Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>	2018-05-02 19:23:06 +02:00
Anatoly Burakov	0db6d2782c	malloc: avoid padding elements on page deallocation Currently, when deallocating pages, malloc will fixup other elements' headers if there is not enough space to store a full element in leftover space. This leads to race conditions because there are some functions that check for pad size with an unlocked heap, expecting pad size to be constant. Fix it by being more conservative and only freeing pages when there is enough space before and after the page to store a free element. Fixes: `1403f87d4f` ("malloc: enable memory hotplug support") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-02 18:35:19 +02:00
Anatoly Burakov	dc14d4f026	malloc: set pad to 0 on free The pad value is not used unless element is in pad state, but it will show up in heap dumps and may be confusing. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-02 18:35:19 +02:00
Jianfeng Tan	3a0d465d4c	eal: fix use-after-free on control thread creation After below commit, we encounter some strange issue: 1) Dead lock as described here: http://dpdk.org/ml/archives/dev/2018-April/099806.html 2) SIGSEGV issue when starting a testpmd in VM. Considering below commit changes to use dynamic memory instead of stack for memory barrier, we doubt it's caused by use-after-free. Fixes: `3d09a6e26d` ("eal: fix threads block on barrier") Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reported-by: Lei Yao <lei.a.yao@intel.com> Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Suggested-by: Olivier Matz <olivier.matz@6wind.com> Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com>	2018-05-02 17:23:37 +02:00
Jianfeng Tan	e87923a9be	eal: fix memory leak on control thread failure params is not freed if pthread_create() fails. The fix is straight-forward. Fixes: `3d09a6e26d` ("eal: fix threads block on barrier") Reported-by: Olivier Matz <olivier.matz@6wind.com> Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Reviewed-by: Olivier Matz <olivier.matz@6wind.com>	2018-05-02 17:15:02 +02:00
Anatoly Burakov	3eb9af3416	malloc: fix heap size not set on init When heap initializes, we need to add already allocated segments onto the heap. However, in doing that, we never increased total heap size. Fix it by adding segment length to total heap length when initializing the heap. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-30 15:33:49 +02:00
Anatoly Burakov	eb8d29f825	mem/linux: fix hugedir write deadlock At hugepage info initialization, EAL takes out a write lock on hugetlbfs directories, and drops it after the memory init is finished. However, in non-legacy mode, if "-m" or "--socket-mem" switches are passed, this leads to a deadlock because EAL tries to allocate pages (and thus take out a write lock on hugedir) while still holding a separate hugedir write lock in EAL. Fix it by checking if write lock in hugepage info is active, and not trying to lock the directory if the hugedir fd is valid. Fixes: `1a7dc2252f` ("mem: revert to using flock and add per-segment lockfiles") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Shahaf Shuler <shahafs@mellanox.com> Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>	2018-04-30 15:23:17 +02:00
Thomas Monjalon	fcde84b5f8	version: 18.05-rc1 Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-28 00:26:04 +02:00
Anatoly Burakov	1a7dc2252f	mem: revert to using flock and add per-segment lockfiles The original implementation used flock() locks, but was later switched to using fcntl() locks for page locking, because fcntl() locks allow locking parts of a file, which is useful for single-file segments mode, where locking the entire file isn't as useful because we still need to grow and shrink it. However, according to fcntl()'s Ubuntu manpage [1], semantics of fcntl() locks have a giant oversight: This interface follows the completely stupid semantics of System V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all locks associated with a file for a given process are removed when any file descriptor for that file is closed by that process. This semantic means that applications must be aware of any files that a subroutine library may access. Basically, closing any fd with an fcntl() lock (which we do because we don't want to leak fd's) will drop the lock completely. So, in this commit, we will be reverting back to using flock() locks everywhere. However, that still leaves the problem of locking parts of a memseg list file in single file segments mode, and we will be solving it with creating separate lock files per each page, and tracking those with flock(). We will also be removing all of this tailq business and replacing it with a simple array - saving a few bytes is not worth the extra hassle of dealing with pointers and potential memory allocation failures. Also, remove the tailq lock since it is not needed - these fd lists are per-process, and within a given process, it is always only one thread handling access to hugetlbfs. So, first one to allocate a segment will create a lockfile, and put a shared lock on it. When we're shrinking the page file, we will be trying to take out a write lock on that lockfile, which would fail if any other process is holding onto the lockfile as well. This way, we can know if we can shrink the segment file. Also, if no other locks are found in the lock list for a given memseg list, the memseg list fd is automatically closed. One other thing to note is, according to flock() Ubuntu manpage [2], upgrading the lock from shared to exclusive is implemented by dropping and reacquiring the lock, which is not atomic and thus would have created race conditions. So, on attempting to perform operations in hugetlbfs, we will take out a writelock on hugetlbfs directory, so that only one process could perform hugetlbfs operations concurrently. [1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Fixes: `582bed1e1d` ("mem: support mapping hugepages at runtime") Fixes: `a5ff05d60f` ("mem: support unmapping pages at runtime") Fixes: `2a04139f66` ("eal: add single file segments option") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	046aa5c447	mem: add memalloc init stage Currently, memseg lists for secondary process are allocated on sync (triggered by init), when they are accessed for the first time. Move this initialization to a separate init stage for memalloc. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	1be7644986	mem: improve autodetection of hugepage counts on 32-bit For non-legacy mode, we are preallocating space for hugepages, so we know in advance which pages we will be able to allocate, and which we won't. However, the init procedure was using hugepage counts gathered from sysfs and paid no attention to hugepage sizes that were actually available for reservation, and failed on attempts to reserve unavailable pages. Fix this by limiting total page counts by number of pages actually preallocated. Also, VA preallocate procedure only looks at mountpoints that are available, and expects pages to exist if a mountpoint exists. That might not necessarily be the case, so also check if there are hugepages available for a particular page size on a particular NUMA node. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	e82ca1a75e	mem: improve preallocation on 32-bit Previously, if we couldn't preallocate VA space on 32-bit for one page size, we simply bailed out, even though we could've tried allocating VA space with other page sizes. For example, if user had both 1G and 2M pages enabled, and has asked DPDK to allocate memory on both sockets, DPDK would've tried to allocate VA space for 1x1G page on both sockets, failed and never tried again, even though it could've allocated the same 1G of VA space for 512x2M pages. Fix this by retrying with different page sizes if VA space reservation failed. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	a99e8df63f	mem: fix 32-bit memory upper limit for non-legacy mode 32-bit mode has an upper limit on amount of VA space it can preallocate, but the original implementation used the wrong constant, resulting in failure to initialize due to integer overflow. Fix it by using the correct constant. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	64b6fcb161	malloc: check for heap corruption Previous code checked for both first/last elements being NULL, but if they weren't, the expectation was that they're both non-NULL, which will be the case under normal conditions, but may not be the case due to heap structure corruption. Coverity issue: 272566 Fixes: `bb372060da` ("malloc: make heap a doubly-linked list") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	0af8db3172	malloc: fix out-of-bounds segment array access Technically, while the pointer would've been invalid if msl_idx were invalid, we wouldn't have actually attempted to access the pointer until verifying the index. Fix it by moving array access to after we've verified validity of the index. Coverity issue: 272574 Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	627e80e4f6	malloc: replace snprintf with strlcpy Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	8f91f368a1	mem: log page address before unmapping If user has specified a flag to unmap the area right after mapping it, we were passing an already-unmapped pointer to RTE_LOG. This is not an issue since RTE_LOG doesn't actually dereference the pointer, but fix it anyway by moving call to RTE_LOG to before unmap. Coverity issue: 272584 Fixes: `b7cc54187e` ("mem: move virtual area function in common directory") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	0f1631be24	mem: fix page fault trigger Coverity reports these lines as having no effect. Technically, we do want for those lines to have no effect, however they would've likely been optimized out. Add volatile qualifiers to ensure the code has effects. Coverity issue: 272608 Fixes: `582bed1e1d` ("mem: support mapping hugepages at runtime") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	e27ffec169	mem: fix potential bad unmap on map failure Previously, if mmap failed to map page address at requested address, we were attempting to unmap the wrong address. Fix it by unmapping our actual mapped address, and jump further to avoid unmapping memory that is not allocated. Coverity issue: 272602 Fixes: `582bed1e1d` ("mem: support mapping hugepages at runtime") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	8ee25c7e81	mem: fix comparison of old policy Previous code had an old rebase leftover from the time when oldpolicy was an actual int, instead of a pointer. Fix it to do comparison with dereferencing the pointer. Coverity issue: 272589 Fixes: `582bed1e1d` ("mem: support mapping hugepages at runtime") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	8dfb09dee4	mem: fix potential resource leak on alloc Normally, tailq entry should have a valid fd by the time we attempt to map the segment. However, in case it doesn't, we're leaking fd, so fix it. Coverity issue: 272570 Fixes: `2a04139f66` ("eal: add single file segments option") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	b48f859a03	mem: fix potential resource leak on freeing We close fd if we managed to find it in the list of allocated segment lists (which should always be the case under normal conditions), but if we didn't, the fd was leaking. Close it if we couldn't find it in the segment list. This is not an issue as if the segment is zero length, we're getting rid of it anyway, so there's no harm in not storing the fd anywhere. Coverity issue: 272568 Fixes: `2a04139f66` ("eal: add single file segments option") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	6f0fa9f238	mem: fix potential double close on map failure We were closing descriptor before checking if mapping has failed, but if it did, we did a second close afterwards. Fix it by moving closing descriptor to after we've done all error checks. Coverity issue: 272560 Fixes: `2a04139f66` ("eal: add single file segments option") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	5441fcfd87	mem: fix resource leak on map failure Coverity issue: 272601 Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	42c2a6a819	mem: use strlcpy instead of snprintf Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:42:40 +02:00
Jianfeng Tan	5e6df556f6	mem: fix resize return handling for --single-file-segments resize_hugefile() returns either 0 (which indicates success) or -1 (which indicates failure). We failed to check the success as we use --single-file-segments option. Fixes: `2a04139f66` ("eal: add single file segments option") Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-27 23:42:40 +02:00
Jianfeng Tan	3d09a6e26d	eal: fix threads block on barrier Below commit introduced pthread barrier for synchronization. But two IPC threads block on the barrier, and never wake up. (gdb) bt #0 futex_wait (private=0, expected=0, futex_word=0x7fffffffcff4) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=0, expected=0, futex_word=0x7fffffffcff4) at ../sysdeps/nptl/futex-internal.h:135 #2 __pthread_barrier_wait (barrier=0x7fffffffcff0) at pthread_barrier_wait.c:184 #3 rte_thread_init (arg=0x7fffffffcfe0) at ../dpdk/lib/librte_eal/common/eal_common_thread.c:160 #4 start_thread (arg=0x7ffff6ecf700) at pthread_create.c:333 #5 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Through analysis, we find the barrier defined on the stack could be the root cause. This patch will change to use heap memory as the barrier. Fixes: `d651ee4919` ("eal: set affinity for control threads") Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>	2018-04-27 21:47:43 +02:00
Xiao Wang	ea2dc10668	vfio: add multi container support This patch adds APIs to support container create/destroy and device bind/unbind with a container. It also provides API for IOMMU programing on a specified container. A driver could use "rte_vfio_container_create" helper to create a new container from eal, use "rte_vfio_container_group_bind" to bind a device to the newly created container. During rte_vfio_setup_device the container bound with the device will be used for IOMMU setup. Signed-off-by: Junjie Chen <junjie.j.chen@intel.com> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-27 15:54:55 +01:00
Xiao Wang	340b7bb8d5	vfio: extend data structure for multi container Currently eal vfio framework binds vfio group fd to the default container fd during rte_vfio_setup_device, while in some cases, e.g. vDPA (vhost data path acceleration), we want to put vfio group to a separate container and program IOMMU via this container. This patch extends the vfio_config structure to contain per-container user_mem_maps and defines an array of vfio_config. The next patch will base on this to add container API. Signed-off-by: Junjie Chen <junjie.j.chen@intel.com> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-27 15:54:55 +01:00
Thomas Monjalon	a5c9b9278c	eal: fix build on FreeBSD The auxiliary vector read is implemented only for Linux. It could be done with procstat_getauxv() for FreeBSD. Since the commit below, the auxiliary vector functions are compiled for every architectures, including x86 which is tested with FreeBSD. This patch is moving the Linux implementation in Linux directory, and adding a fake/empty implementation for FreeBSD. Fixes: `2ed9bf3307` ("eal: abstract away the auxiliary vector") Signed-off-by: Thomas Monjalon <thomas@monjalon.net> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-04-27 11:13:59 +02:00
Thomas Monjalon	8ddd6a90ea	eal: fix build with glibc < 2.16 The fake getauxval function does not use its parameter. So the compiler raised this error: lib/librte_eal/common/eal_common_cpuflags.c:25:25: error: unused parameter 'type' Fixes: `2ed9bf3307` ("eal: abstract away the auxiliary vector") Signed-off-by: Thomas Monjalon <thomas@monjalon.net> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-04-27 11:12:53 +02:00
Stephen Hemminger	06bfd1cbb4	eal: shut up warning about master lcore This message looks suspicious and seen on healthy testpmd. EAL: WARNING: Master core has no memory on local socket! The message is wrong: the master lcore is 0 and its socket is 0 and there are multiple available memory segments on socket 0. At that point in the startup process, the count value is zero, meaning they are not used yet so the check_socket gets confused. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-26 17:40:21 +02:00
Anatoly Burakov	b0a1502a27	eal: make semantics of lcore role function more intuitive rte_lcore_has_role() returns 0 if role of lcore matches requested role. The return value of the API is confusing, and this is a known problem with a deprecation notice announcing the change to more intuitive semantics: Commit `064518f68d` ("doc: announce EAL API change to lcore role function") Implement changes announced in the deprecation notice, and remove it. Also, fix usages of this API to reflect the change. Control thread patches expected new behavior and were broken before, now they are fixed as well. Fixes: `d651ee4919` ("eal: set affinity for control threads") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Erik Gabriel Carrillo <erik.g.carrillo@intel.com> Signed-off-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-26 16:58:18 +02:00
Harry van Haaren	60df571197	service: remove experimental tags This commit removes the experimental tags from the service cores functions, they now become part of the main DPDK API/ABI. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-25 14:57:37 +02:00
Anatoly Burakov	4754ceaa09	eal/linux: remove useless unlock of hugepage when clearing Coverity was complaining about not checking result of call to fcntl() for unlocking the file. Disregarding the fact that error value returned from fcntl() unlock call is highly unlikely in the first place, we are subsequently calling close() on that same fd, which will drop the lock, which makes call to fcntl() unnecessary. Fix this by removing a call to fcntl() altogether. Coverity issue: 272607 Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-25 12:41:55 +02:00
Stephen Hemminger	7f0bb634a1	log: add ability to match log type with globbing Regular expressions are not the best way to match a hierarchical pattern like dynamic log levels. And the separator for dynamic log levels is period which is the regex wildcard character. A better solution is to use filename matching 'globbing' so that log levels match like file paths. For compatibility, use colon to separate pattern match style arguments. For example: --log-level 'pmd.net.virtio.*:debug' This also makes the documentation match what really happens internally. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2018-04-25 12:14:37 +02:00
Stephen Hemminger	fa20768905	eal: make log level save private We don't want format of eal log level saved values to be visible in ABI. Move to private storage in eal_common_log. Includes minor optimization. Compile the regular expression for each log match once, rather than each time it is used. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2018-04-25 12:12:19 +02:00
Stephen Hemminger	5690d37ddf	eal: allow symbolic log levels Much easeier to remember names than numbers. Allows --log-level=pmd.net.ixgbe.*,debug Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2018-04-25 12:11:47 +02:00
Stephen Hemminger	34a5c2db56	eal: make syslog facility table const The mapping for facility name to value can be const. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2018-04-25 12:10:52 +02:00
Aaron Conole	2ed9bf3307	eal: abstract away the auxiliary vector Rather than attempting to load the contents of the auxv directly, prefer to use an exposed API - and if that doesn't exist then attempt to load the vector. This is because on some systems, when a user is downgraded, the /proc/self/auxv file retains the old ownership and permissions. The original method of /proc/self/auxv is retained. This also removes a potential abort() in the code when compiled with NDEBUG. A quick parse of the code shows that many (if not all) of the CPU flag parsing isn't used internally, so it should be okay. Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>	2018-04-25 04:29:00 +02:00
Gaetan Rivet	8b50041c0b	eal: add last init priority Add the priority RTE_PRIORITY_LAST, used for initialization routines meant to be run after all other constructors. This priority becomes the default priority for all DPDK constructors. Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com> Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>	2018-04-25 04:18:11 +02:00
Gaetan Rivet	f779053ab3	eal: list acceptable init priorities Build a central list to quickly see each used priorities for constructors, allowing to verify that they are both above 100 and in the proper order. Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>	2018-04-25 04:18:09 +02:00
Gaetan Rivet	b65ecf1993	devargs: rename legacy API The previous symbols were deprecated for two releases. They are now marked as such and cannot be used anymore. They are replaced by ones respecting the new namespace that are marked experimental. As a result, eth_dev attach and detach are slightly reworked to follow the changes. Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-25 04:00:37 +02:00
Gaetan Rivet	8e6c3b795e	devargs: use proper namespace prefix rte_eal_devargs is useless, rte_devargs is sufficient. Only experimental functions are changed for now. Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-25 04:00:22 +02:00
Gaetan Rivet	b629ab790c	devargs: update syntax documentation Device syntax documentation is out of date. Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-25 03:58:49 +02:00
Gaetan Rivet	9e6b5ea992	devargs: make parsing variadic rte_eal_devargs_parse can be used by EAL subsystems, drivers, applications alike. Device parameters may be presented with different structure each time; as a single declaration string or several strings each describing different parts of the declaration. To simplify the use of this parsing facility, its parameters are made variadic. Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-25 03:58:45 +02:00
Gaetan Rivet	c7b424c03d	devargs: make devargs list private Initially, rte_devargs was meant to be populated once and sometimes accessed, then never emptied. With the new hotplug functionality having better standing, new usage appeared with repeated addition of devices and their subsequent removal. Exposing devargs_list pushed bus drivers and libraries to be careless and inconsistent in their memory management. Making it private will allow to rationalize this part of the EAL and ensure that fewer memory leaks occur during operations. Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-25 03:58:24 +02:00
Gaetan Rivet	e53e0fe0c2	devargs: introduce iterator In preparation to making devargs_list private. Bus drivers generally need to access rte_devargs pertaining to their operations. This match is a common operation for bus drivers. Add a new accessor for the rte_devargs list. Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2018-04-25 03:57:51 +02:00
Olivier Matz	d651ee4919	eal: set affinity for control threads The management threads must not bother the dataplane or service cores. Set the affinity of these threads accordingly. Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-25 00:51:31 +02:00
Olivier Matz	6383d2642b	eal: set name when creating a control thread To avoid code duplication, add a parameter to rte_ctrl_thread_create() to specify the name of the thread. This requires to add a wrapper for the thread start routine in rte_thread_init(), which will first wait that the thread is configured. Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-25 00:51:31 +02:00
Olivier Matz	9e5afc72c9	eal: add function to create control threads Many parts of dpdk use their own management threads. Introduce a new wrapper for thread creation that will be extended in next commits to set the name and affinity. To be consistent with other DPDK APIs, the return value is negative in case of error, which was not the case for pthread_create(). Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-25 00:51:31 +02:00
Olivier Matz	dec7b1884a	use sizeof to avoid double use of a length define Only a cosmetic change: the *_LEN defines are already used when defining the buffer. Using sizeof() ensures that the length stays consistent, even if the definition is modified. Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-25 00:51:31 +02:00
Jianfeng Tan	79967252c3	eal: bring forward multi-process channel init Adjust the init sequence: put mp channel init before bus scan so that we can init the vdev bus through mp channel in the secondary process before the bus scan. Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>	2018-04-24 12:31:26 +02:00
Jianfeng Tan	b8c835909e	ipc: fix timeout handling in async In original implementation, timeout event for an async request will be ignored. As a result, an async request will never trigger the action if it cannot receive any reply any more. We fix this by counting timeout as a processed reply. Fixes: `f05e26051c` ("eal: add IPC asynchronous request") Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-23 22:45:05 +02:00
Jianfeng Tan	2147c09505	ipc: clean up code Following below commit, we change some internal function and variable names: commit `ce3a731235` ("eal: rename IPC request as synchronous one") Also use calloc to supersede malloc + memset for code clean up. Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-23 22:44:26 +02:00
Anatoly Burakov	441d676777	ipc: fix resource leak in init failure Coverity issue: 272609 Fixes: `f05e26051c` ("eal: add IPC asynchronous request") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-23 22:44:25 +02:00
Anatoly Burakov	dd7b7f9a52	ipc: fix return without mutex unlock gettimeofday() returning a negative value is highly unlikely, but if it ever happens, we will exit without unlocking the mutex. Arguably at that point we'll have bigger problems, but fix this issue anyway. Coverity issue: 272595 Fixes: `f05e26051c` ("eal: add IPC asynchronous request") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-23 22:44:24 +02:00
Anatoly Burakov	505721e170	ipc: use strlcpy where applicable This also silences (or should silence) a few Coverity false positives where we used strcpy before (Coverity complained about not checking buffer size, but source buffers were always known to be sized correctly). Coverity issue: 260407, 272565, 272582 Fixes: `bacaa27540` ("eal: add channel for multi-process communication") Fixes: `f05e26051c` ("eal: add IPC asynchronous request") Fixes: `783b6e5497` ("eal: add synchronous multi-process communication") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-23 22:44:23 +02:00
Anatoly Burakov	7508be4cce	fbarray: check sysconf failure sysconf() may return a negative value, check for it. Coverity issue: 272586 Fixes: `c44d09811b` ("eal: add shared indexed file-backed array") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-04-23 22:44:22 +02:00
Anatoly Burakov	f9a4f1b462	fbarray: fix potential null-dereference We get pointer to mask before we check if fbarray is NULL. Fix by moving getting mask pointer to until after NULL check. Coverity issue: 272579 Fixes: `c44d09811b` ("eal: add shared indexed file-backed array") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-04-23 22:44:21 +02:00
Anatoly Burakov	2bcbc4d12c	fbarray: check for open failure Coverity issue: 272564 Fixes: `c44d09811b` ("eal: add shared indexed file-backed array") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-04-23 22:44:21 +02:00
Anatoly Burakov	9d3ba1e0ad	fbarray: use strlcpy instead of snprintf Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>	2018-04-23 22:44:20 +02:00
Anatoly Burakov	2c8663f9d0	fbarray: make all fbarrays hidden files fbarray stores its data in a shared file, which is not hidden. This leads to polluting user's HOME directory with visible files when running DPDK as non-root. Change fbarray to always create hidden files by default. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-23 22:44:17 +02:00
Xiao Wang	b3a022b17c	vfio: fix boundary check in region search A previously mapped region is skipped during the search, leading to DMA unmap fails. This patch fixes it and rewords the comment. Fixes: `73a6390859` ("vfio: allow to map other memory regions") Signed-off-by: Xiao Wang <xiao.w.wang@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-23 21:24:22 +02:00
Thomas Monjalon	91c6de7eb7	eal/linux: use strlcpy in uevent parsing Support of strlcpy has recently been added to DPDK. This replacement has been generated by the coccinelle script: devtools/cocci.sh devtools/cocci/strlcpy.cocci Fixes: `0d0f478d04` ("eal/linux: add uevent parse and process") Signed-off-by: Thomas Monjalon <thomas@monjalon.net> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-23 16:23:15 +02:00
Yangchao Zhou	fb338b80e5	mem: fix leaks of hugedir and replace snprintf The hugedir returned by get_hugepage_dir is allocated by strdup but not released. Replace snprintf with a more suitable strlcpy. Coverity issue: 272585 Fixes: `cb97d93e9d` ("mem: share hugepage info primary and secondary") Signed-off-by: Yangchao Zhou <zhouyates@gmail.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-18 10:58:10 +02:00
Junjie Chen	1c9467a6ef	eal/x86: force inlining of memcpy sub-functions Sometimes gcc does not inline the function despite keyword inline, we observe rte_movX is not inline when doing performance profiling, so use always_inline keyword to force gcc to inline the function. Signed-off-by: Junjie Chen <junjie.j.chen@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-18 09:22:56 +02:00
Jianfeng Tan	83a73c5fef	vfio: use generic multi-process channel Previously, vfio uses its own private channel for the secondary process to get container fd and group fd from the primary process. This patch changes to use the generic mp channel. Test: 1. Bind two NICs to vfio-pci. 2. Start the primary and secondary process. $ (symmetric_mp) -c 2 -- -p 3 --num-procs=2 --proc-id=0 $ (symmetric_mp) -c 4 --proc-type=auto -- -p 3 \ --num-procs=2 --proc-id=1 Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-18 01:26:06 +02:00
Adrien Mazarguil	6b298c6285	eal: fix signed integers in fbarray While debugging startup issues encountered with Clang (see "eal: fix undefined behavior in fbarray"), I noticed that fbarray stores indices, sizes and masks on signed integers involved in bitwise operations. Such operations almost invariably cause undefined behavior with values that cannot be represented by the result type, as is often the case with bit-masks and left-shifts. This patch replaces them with unsigned integers as a safety measure and promotes a few internal variables to larger types for consistency. Coverity issue: 272598, 272599 Fixes: `c44d09811b` ("eal: add shared indexed file-backed array") Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-17 14:38:16 +02:00
Adrien Mazarguil	f2e5e85824	eal: fix undefined behavior in fbarray According to GCC documentation [1], the __builtin_clz() family of functions yield undefined behavior when fed a zero value. There is one instance in the fbarray code where this can occur. Clang (at least version 3.8.0-2ubuntu4) seems much more sensitive to this than GCC and yields random results when compiling optimized code, as shown below: #include <stdio.h> int main(void) { volatile unsigned long long moo; int x; moo = 0; x = __builtin_clzll(moo); printf("%d\n", x); return 0; } $ gcc -O3 -o test test.c && ./test 63 $ clang -O3 -o test test.c && ./test 1742715559 $ clang -O0 -o test test.c && ./test 63 Even 63 can be considered an unexpected result given the number of leading zeroes should be the full width of the underlying type, i.e. 64. In practice it causes find_next_n() to sometimes return negative values interpreted as errors by caller functions, which prevents DPDK applications from starting due to inability to find free memory segments: # testpmd [...] EAL: Detected 32 lcore(s) EAL: Detected 2 NUMA nodes EAL: No free hugepages reported in hugepages-1048576kB EAL: Multi-process socket /var/run/.rte_unix EAL: eal_memalloc_alloc_seg_bulk(): couldn't find suitable memseg_list EAL: FATAL: Cannot init memory EAL: Cannot init memory PANIC in main(): Cannot init EAL 4: [./build/app/testpmd(_start+0x29) [0x462289]] 3: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f19d54fc830]] 2: [./build/app/testpmd(main+0x8a3) [0x466193]] 1: [./build/app/testpmd(__rte_panic+0xd6) [0x4efaa6]] Aborted This problem appears with commit `66cc45e293` ("mem: replace memseg with memseg lists") however the root cause is introduced by a prior patch. [1] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html Fixes: `c44d09811b` ("eal: add shared indexed file-backed array") Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-17 14:37:27 +02:00
Anatoly Burakov	079527f069	malloc: fix not unlocking hotplug on fail to init We lock the hotplug during init, but do not unlock it if we couldn't register multiprocess callbacks. Add the missing unlock. Fixes: `07dcbfe010` ("malloc: support multiprocess memory hotplug") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-17 12:36:40 +02:00
Anatoly Burakov	48e9728898	ipc: fix missing mutex unlocks on failed send Earlier fix for race condition introduced a bug where mutex wasn't unlocked if message failed to be sent. Fix all of this by moving locking out of mp_request_sync() altogether. Fixes: `da5957821b` ("eal: fix race condition in IPC request") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-17 10:23:05 +02:00
Anatoly Burakov	7d863e253e	ipc: fix missing ignore message name We are trying to notify sender that response from current process should be ignored, but we didn't specify which request this response was for. Fix by copying request name from the original message. Fixes: `579a4ccc34` ("eal: ignore IPC messages until init is complete") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-17 01:27:45 +02:00
Anatoly Burakov	35ae44d1e2	ipc: fix use-after-free in asynchronous requests Previously, we were removing request from the list only if we have succeeded to send it. This resulted in leaving an invalid pointer in the request list. Fix this by only adding new requests to the request list if we have succeeded in sending them. Fixes: `f05e26051c` ("eal: add IPC asynchronous request") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-17 01:27:27 +02:00
Anatoly Burakov	fe98e52a52	ipc: fix use-after-free in synchronous requests Previously, we were adding synchronous requests to request list, we were doing it after checking if request existed. However, we only removed the request from the request list if we have succeeded in sending the request. In case of failed request send, we left an invalid pointer in the request list. Fix this by only adding request to the list once we succeed in sending it. Fixes: `783b6e5497` ("eal: add synchronous multi-process communication") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-17 01:27:21 +02:00
Anatoly Burakov	2ae831fb42	ipc: stop async IPC loop on callback request EAL did not stop processing further asynchronous requests on encountering a request that should trigger the callback. This resulted in erasing valid requests but not triggering them. Fix this by stopping the loop once we have a request that can trigger the callback. Once triggered, we go back to scanning the request queue until there are no more callbacks to trigger. Fixes: `f05e26051c` ("eal: add IPC asynchronous request") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-17 01:27:20 +02:00
Anatoly Burakov	6e8a721044	vfio: export functions even when disabled Previously, VFIO functions were not compiled in and exported if VFIO compilation was disabled. Fix this by actually compiling all of the functions unconditionally, and provide missing prototypes on Linux. Fixes: `279b581c89` ("vfio: expose functions") Fixes: `73a6390859` ("vfio: allow to map other memory regions") Fixes: `964b2f3bfb` ("vfio: export some internal functions") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-16 19:33:46 +02:00
Jeff Guo	0d0f478d04	eal/linux: add uevent parse and process In order to handle the uevent which has been detected from the kernel side, add uevent parse and process function to translate the uevent into device event, which user has subscribed to monitor. Signed-off-by: Jeff Guo <jia.guo@intel.com> Reviewed-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-13 12:00:31 +02:00
Jeff Guo	a753e53d51	eal: add device event monitor framework This patch aims to add a general device event monitor framework at EAL device layer, for device hotplug awareness and actions adopted accordingly. It could also expand for all other types of device event monitor, but not in this scope at the stage. To get started, users firstly call below new added APIs to enable/disable the device event monitor mechanism: - rte_dev_event_monitor_start - rte_dev_event_monitor_stop Then users shell register or unregister callbacks through the new added APIs. Callbacks can be some device specific, or for all devices. -rte_dev_event_callback_register -rte_dev_event_callback_unregister Use hotplug case for example, when device hotplug insertion or hotplug removal, we will get notified from kernel, then call user's callbacks accordingly to handle it, such as detach or attach the device from the bus, and could benefit further fail-safe or live-migration. Signed-off-by: Jeff Guo <jia.guo@intel.com> Reviewed-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-13 12:00:31 +02:00
Jeff Guo	493b8e173f	eal: add device event handle in interrupt thread Add new interrupt handle type of RTE_INTR_HANDLE_DEV_EVENT, for device event interrupt monitor. Signed-off-by: Jeff Guo <jia.guo@intel.com> Reviewed-by: Jianfeng Tan <jianfeng.tan@intel.com>	2018-04-13 10:49:26 +02:00
Anatoly Burakov	08a20b3d37	vfio: fix device hotplug when several devices per group We only need to perform DMA mapping for first device in first group. At the time of mapping, we haven't yet added the device into the group, so the count is expected to be zero. Fixes: `810bfa64c6` ("vfio: fix index for tracking devices in a group") Fixes: `a9c349e3a1` ("vfio: fix device unplug when several devices per group") Fixes: `94c0776b1b` ("vfio: support hotplug") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-13 01:17:55 +02:00
Hemant Agrawal	964b2f3bfb	vfio: export some internal functions This patch moves some of the internal vfio functions from eal_vfio.h to rte_vfio.h for common uses with "rte_" prefix. This patch also change the FSLMC bus usages from the internal VFIO functions to external ones with "rte_" prefix Signed-off-by: Hemant Agrawal <hemant.agrawal@nxp.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-13 01:06:57 +02:00
Hemant Agrawal	c94eb6db0a	doc: add VFIO API in doxygen Signed-off-by: Hemant Agrawal <hemant.agrawal@nxp.com>	2018-04-13 01:06:12 +02:00
Neil Horman	34fbfa585c	mem: set fd to -1 for anonymous mmap https://dpdk.org/tracker/show_bug.cgi?id=18 Indicated that several mmap call sites in the [linux\|bsd]app eal code set fd that was not -1 in their calls while using MAP_ANONYMOUS. While probably not a huge deal, the man page does say the fd should be -1 for portability, as some implementations don't ignore fd as they should for MAP_ANONYMOUS. Suggested-by: Solal Pirelli <solal.pirelli@gmail.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-04-12 14:44:24 +02:00
Pavan Nikhilesh	7bdccb9307	eal: fix ARM build with clang Use __atomic_exchange_n instead of __atomic_exchange_(2/4/8). The error was: include/generic/rte_atomic.h:215:9: error: implicit declaration of function '__atomic_exchange_2' is invalid in C99 include/generic/rte_atomic.h:494:9: error: implicit declaration of function '__atomic_exchange_4' is invalid in C99 include/generic/rte_atomic.h:772:9: error: implicit declaration of function '__atomic_exchange_8' is invalid in C99 Fixes: `ff2863570f` ("eal: introduce atomic exchange operation") Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>	2018-04-11 22:39:50 +02:00
Anatoly Burakov	6f63858e55	mem: prevent preallocated pages from being freed It is common sense to expect for DPDK process to not deallocate any pages that were preallocated by "-m" or "--socket-mem" flags - yet, currently, DPDK memory subsystem will do exactly that once it finds that the pages are unused. Fix this by marking pages as unfreebale, and preventing malloc from ever trying to free them. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:56 +02:00
Anatoly Burakov	93723dd917	malloc: enable validation before new page allocation Before allocating a new page, give a chance to the user to allow or deny allocation via callbacks. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:56 +02:00
Anatoly Burakov	2e378ff297	mem: add validator callback This API will enable application to register for notifications on page allocations that are about to happen, giving the application a chance to allow or deny the allocation when total memory utilization as a result would be above specified limit on specified socket. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:56 +02:00
Anatoly Burakov	6b42f75632	eal: enable non-legacy memory mode Now that every other piece of the puzzle is in place, enable non-legacy init mode. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:56 +02:00
Anatoly Burakov	43e4631371	vfio: support memory event callbacks Enable callbacks on first device attach, disable callbacks on last device attach. PPC64 IOMMU does memseg walk, which will cause a deadlock on trying to do it inside a callback, so provide a local, thread-unsafe copy of memseg walk. PPC64 IOMMU also may remap the entire memory map for DMA while adding new elements to it, so change user map list lock to a recursive lock. That way, we can safely enter rte_vfio_dma_map(), lock the user map list, enter DMA mapping function and lock the list again (for reading previously existing maps). Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	76b15480d6	malloc: enable callbacks on alloc/free and mp sync Callbacks will be triggered just after allocation and just before deallocation, to ensure that memory address space referenced in the callback is always valid by the time callback is called. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	56efb4c117	malloc: support callbacks on memory events Each process will have its own callbacks. Callbacks will indicate whether it's allocation and deallocation that's happened, and will also provide start VA address and length of allocated block. Since memory hotplug isn't supported on FreeBSD and in legacy mem mode, it will not be possible to register them in either. Callbacks are called whenever something happens to the memory map of current process, therefore at those times memory hotplug subsystem is write-locked, which leads to deadlocks on attempt to use these functions. Document the limitation. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	07dcbfe010	malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	1403f87d4f	malloc: enable memory hotplug support This set of changes enables rte_malloc to allocate and free memory as needed. Currently, it is disabled because legacy mem mode is enabled unconditionally. The way it works is, first malloc checks if there is enough memory already allocated to satisfy user's request. If there isn't, we try and allocate more memory. The reverse happens with free - we free an element, check its size (including free element merging due to adjacency) and see if it's bigger than hugepage size and that its start and end span a hugepage or more. Then we remove the area from malloc heap (adjusting element lengths where appropriate), and deallocate the page. For legacy mode, runtime alloc/free of pages is disabled. It is worth noting that memseg lists are being sorted by page size, and that we try our best to satisfy user's request. That is, if the user requests an element from a 2MB page memory, we will check if we can satisfy that request from existing memory, if not we try and allocate more 2MB pages. If that fails and user also specified a "size is hint" flag, we then check other page sizes and try to allocate from there. If that fails too, then, depending on flags, we may try allocating from other sockets. In other words, we try our best to give the user what they asked for, but going to other sockets is last resort - first we try to allocate more memory on the same socket. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	6167d81488	mem: add secondary process init with memory hotplug Secondary initialization will just sync memory map with primary process. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	cb97d93e9d	mem: share hugepage info primary and secondary Since we are going to need to map hugepages in both primary and secondary processes, we need to know where we should look for hugetlbfs mountpoints. So, share those with secondary processes, and map them on init. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	41519b9006	mem: make use of memory hotplug for init Add a new (non-legacy) memory init path for EAL. It uses the new memory hotplug facilities. If no -m or --socket-mem switches were specified, the new init will not allocate anything, whereas if those switches were passed, appropriate amounts of pages would be requested, just like for legacy init. Allocated pages will be physically discontiguous (or rather, they're not guaranteed to be physically contiguous - they may still be so by accident) unless RTE_IOVA_VA mode is used. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	b666f17858	mem: read hugepage counts from node-specific sysfs path For non-legacy memory init mode, instead of looking at generic sysfs path, look at sysfs paths pertaining to each NUMA node for hugepage counts. Note that per-NUMA node path does not provide information regarding reserved pages, so we might not get the best info from these paths, but this saves us from the whole mapping/remapping business before we're actually able to tell which page is on which socket, because we no longer require our memory to be physically contiguous. Legacy memory init will not use this. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	524e43c2ad	mem: prepare memseg lists for multiprocess sync In preparation for implementing multiprocess support, we are adding a version number to memseg lists. We will not need any locks, because memory hotplug will have a global lock (so any time memory map and thus version number might change, we will already be holding a lock). There are two ways of implementing multiprocess support for memory hotplug: either all information about mapped memory is shared between processes, and secondary processes simply attempt to map/unmap memory based on requests from the primary, or secondary processes store their own maps and only check if they are in sync with the primary process' maps. This implementation will opt for the latter option: primary process shared mappings will be authoritative, and each secondary process will use its own interal view of mapped memory, and will attempt to synchronize on these mappings using versioning. Under this model, only primary process will decide which pages get mapped, and secondary processes will only copy primary's page maps and get notified of the changes via IPC mechanism (coming in later commits). Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	c8f73de36e	mem: add function to check if memory is contiguous For now, memory is always contiguous because legacy mem mode is enabled unconditionally, but this function will be helpful down the line when we implement support for allocating physically non-contiguous memory. We can no longer guarantee physically contiguous memory unless we're in legacy or IOVA_AS_VA mode, but we can certainly try and see if we succeed. In addition, this would be useful for e.g. PMD's who may allocate chunks that are smaller than the pagesize, but they must not cross the page boundary, in which case we will be able to accommodate that request. This function will also support non-hugepage memory. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	2a04139f66	eal: add single file segments option Currently, DPDK stores all pages as separate files in hugetlbfs. This option will allow storing all pages in one file (one file per memseg list). We do this by using fallocate() calls on FreeBSD, however this is only supported on fairly recent (4.3+) kernels, so ftruncate() fallback is provided to grow (but not shrink) hugepage files. Naming scheme is deterministic, so both primary and secondary processes will be able to easily map needed files and offsets. For multi-file segments, we can close fd's right away. For single-file segments, we can reuse the same fd and reduce the amount of fd's needed to map/use hugepages. However, we need to store the fd's somewhere, so we add a tailq. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	a5ff05d60f	mem: support unmapping pages at runtime This isn't used anywhere yet, but the support is now there. Also, adding cleanup to allocation procedures, so that if we fail to allocate everything we asked for, we can free all of it back. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:57:20 +02:00
Anatoly Burakov	582bed1e1d	mem: support mapping hugepages at runtime Nothing uses this code yet. The bulk of it is copied from old memory allocation code (linuxapp eal_memory.c). We provide an EAL-internal API to allocate either one page or multiple pages, guaranteeing that we'll get contiguous VA for all of the pages that we requested. Not supported on FreeBSD. Locking is done via fcntl() because that way, when it comes to taking out write locks or unlocking on deallocation, we don't have to keep original fd's around. Plus, using fcntl() gives us ability to lock parts of a file, which is useful for single-file segments, which are coming down the line. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:56:37 +02:00
Anatoly Burakov	49df3db848	memzone: replace memzone array with fbarray It's there, so we might as well use it. Some operations will be sped up by that. Since we have to allocate an fbarray for memzones, we have to do it before we initialize memory subsystem, because that, in secondary processes, will (later) allocate more fbarrays than the primary process, which will result in inability to attach to memzone fbarray if we do it after the fact. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:56:30 +02:00
Anatoly Burakov	66cc45e293	mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:39 +02:00
Anatoly Burakov	c44d09811b	eal: add shared indexed file-backed array rte_fbarray is a simple indexed array stored in shared memory via mapping files into memory. Rationale for its existence is the following: since we are going to map memory page-by-page, there could be quite a lot of memory segments to keep track of (for smaller page sizes, page count can easily reach thousands). We can't really make page lists truly dynamic and infinitely expandable, because that involves reallocating memory (which is a big no-no in multiprocess). What we can do instead is have a maximum capacity as something really, really large, and decide at allocation time how big the array is going to be. We map the entire file into memory, which makes it possible to use fbarray as shared memory, provided the structure itself is allocated in shared memory. Per-fbarray locking is also used to avoid index data races (but not contents data races - that is up to user application to synchronize). In addition, in understanding that we will frequently need to scan this array for free space and iterating over array linearly can become slow, rte_fbarray provides facilities to index array's usage. The following use cases are covered: - find next free/used slot (useful either for adding new elements to fbarray, or walking the list) - find starting index for next N free/used slots (useful for when we want to allocate chunk of VA-contiguous memory composed of several pages) - find how many contiguous free/used slots there are, starting from specified index (useful for when we want to figure out how many pages we have until next hole in allocated memory, to speed up some bulk operations where we would otherwise have to walk the array and add pages one by one) This is accomplished by storing a usage mask in-memory, right after the data section of the array, and using some bit-level magic to figure out the info we need. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:21 +02:00
Anatoly Burakov	182cf0c28d	eal: add legacy memory option This adds a "--legacy-mem" command-line switch. It will be used to go back to the old memory behavior, one where we can't dynamically allocate/free memory (the downside), but one where the user can get physically contiguous memory, like before (the upside). For now, nothing but the legacy behavior exists, non-legacy memory init sequence will be added later. For FreeBSD, non-legacy memory init will never be enabled, while for Linux, it is disabled in this patch to avoid breaking bisect, but will be enabled once non-legacy mode will be fully operational. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:13 +02:00
Anatoly Burakov	73a6390859	vfio: allow to map other memory regions Currently it is not possible to use memory that is not owned by DPDK to perform DMA. This scenarion might be used in vhost applications (like SPDK) where guest send its own memory table. To fill this gap provide API to allow registering arbitrary address in VFIO container. Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:10 +02:00
Anatoly Burakov	aa6a098a8f	memzone: use walk instead of iteration for dumping Simplify memzone dump code to use memzone walk, to not maintain the same memzone iteration code twice. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:05 +02:00
Anatoly Burakov	f901e64d21	mem: add virt2memseg function This can be used as a virt2iova function that only looks up memory that is owned by DPDK (as opposed to doing pagemap walks). Using this will result in less dependency on internals of mem API. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:54:44 +02:00
Anatoly Burakov	eca28edd98	mem: add iova2virt function This is reverse lookup of PA to VA. Using this will make other code less dependent on internals of mem API. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:54:00 +02:00
Anatoly Burakov	552afc420a	mem: add contig walk function This function is meant to walk over first segment of each VA-contiguous group of memsegs. For future users of this function, this is done so that there is less dependency on internals of mem API and less noise later change sets. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:53:38 +02:00
Anatoly Burakov	20681b17ba	vfio/spapr: use memseg walk instead of iteration Reduce dependency on internal details of EAL memory subsystem, and simplify code. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:53:36 +02:00
Anatoly Burakov	12167c0cc2	vfio/type1: use memseg walk instead of iteration Reduce dependency on internal details of EAL memory subsystem, and simplify code. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:53:08 +02:00
Anatoly Burakov	221b67bca0	eal: use memseg walk instead of iteration Reduce dependency on internal details of EAL memory subsystem, and simplify code. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:48:15 +02:00

... 2 3 4 5 6 ...

1848 Commits