numam-dpdk

Author	SHA1	Message	Date
Anatoly Burakov	6080796f65	mem: make base address hint OS specific Not all OS's follow Linux's memory layout, which may lead to problems following the suggested common address hint absent of a base-virtaddr flag. Make this address hint OS-specific. Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-10-26 18:03:24 +02:00
Anatoly Burakov	028669bc9f	eal: hide shared memory config Now that everything that has ever accessed the shared memory config is doing so through the public API's, we can make it internal. Since we're removing quite a few headers from rte_eal_memconfig.h, we need to add them back in places where this header is used. This bumps the ABI, so also change all build files and make update documentation. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-06 10:32:34 +02:00
Anatoly Burakov	76f80881ef	mem: add API to lock/unlock memory hotplug Currently, the memory hotplug is locked automatically by all memory-related _walk() functions, but sometimes locking the memory subsystem outside of them is needed. There is no public API to do that, so it creates a dependency on shared memory config to be public. Fix this by introducing a new API to lock/unlock the memory hotplug subsystem. Create a new common file for all things mem config, and a new API namespace rte_mcfg_*, and search-and-replace all usages of the locks with the new API. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-05 22:12:40 +02:00
David Marchand	cfe3aeb170	remove experimental tags from all symbol definitions We had some inconsistencies between functions prototypes and actual definitions. Let's avoid this by only adding the experimental tag to the prototypes. Tests with gcc and clang show it is enough. git grep -l __rte_experimental \|grep \.c$ \|while read file; do sed -i -e '/^__rte_experimental$/d' $file; sed -i -e 's/ __rte_experimental//' $file; sed -i -e 's/__rte_experimental //' $file; done Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com> Acked-by: Neil Horman <nhorman@tuxdriver.com>	2019-06-29 19:04:43 +02:00
David Marchand	146e002c68	mem: remove incorrect experimental tag on static symbol This function is not visible from outside this code unit. Fixes: `84e7477e10` ("mem: add thread unsafe version for DMA mask check") Cc: stable@dpdk.org Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com> Acked-by: Neil Horman <nhorman@tuxdriver.com>	2019-06-29 19:04:39 +02:00
Shahaf Shuler	237060c4ad	mem: limit use of address hint The commit below added an address hint as starting address for 64-bit systems in case an explicit base virtual address was not set by the user. The justification for such hint was to help devices that work in VA mode and has a address range limitation to work smoothly with the eal memory subsystem. While the base address value selected may work fine for the eal initialization, it easily breaks when trying to register external memory using rte_extmem_register API. Trying to register anonymous memory on RH x86_64 machine took several minutes, during them the function eal_get_virtual_area repeatedly scanned for a good VA candidate. The attempt to guess which VA address will be free for mapping will always result in not portable, error prone code: * different application may use different libraries along w/ DPDK. One can never guess which library was called first and how much virtual memory it consumed. * external memory can be registered at any time in the application run time. In order not to break the existing secondary process design, this patch only limits the max number of tries that will be done with the address hint. When the number of tries exceeds the threshold the code will use the suggested address from kernel. Fixes: `1df2170287` ("mem: use address hint for mapping hugepages") Cc: stable@dpdk.org Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Tested-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Alejandro Lucero <alejandro.lucero@netronome.com>	2019-04-03 19:10:47 +02:00
Anatoly Burakov	525670756a	mem: fix segment fd API error code for external segment Segment fd API does not support getting segment fd's from externally allocated memory, so return proper error code on any attempts to do so. This changes API behavior, so document the change as well. Fixes: `5282bb1c36` ("mem: allow memseg lists to be marked as external") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Tiwei Bie <tiwei.bie@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-12-20 22:51:49 +01:00
Anatoly Burakov	bed7941886	mem: allow usage of non-heap external memory in multiprocess Add multiprocess support for externally allocated memory areas that are not added to DPDK heap (and add relevant doc sections). Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Yongseok Koh <yskoh@mellanox.com>	2018-12-20 18:14:55 +01:00
Anatoly Burakov	950e8fb4e1	mem: allow registering external memory areas The general use-case of using external memory is well covered by existing external memory API's. However, certain use cases require manual management of externally allocated memory areas, so this memory should not be added to the heap. It should, however, be added to DPDK's internal structures, so that API's like ``rte_virt2memseg`` would work on such external memory segments. This commit adds such an API to DPDK. The new functions will allow to register and unregister externally allocated memory areas, as well as documentation for them. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Yongseok Koh <yskoh@mellanox.com>	2018-12-20 18:14:55 +01:00
Alejandro Lucero	ee0e074f81	mem: fix DMA mask width sanity check Current code has different max DMA mask width values for 32 and 64 bits systems. IOMMU hardware could report a higher supported width than current MAX_DMA_MASK_BITS when RTE_ARCH_64 is not defined. This is actually true with a 32 bits kernel running in a 64 bits server with IOMMU hardware. This could also be a problem with embedded systems using an IOMMU designed for 64 bits in a 32 bits system. This patch leaves a single max DMA mask width which will make sure the mask width is within the range for 64 bits variables used for DMA mask. This also will avoid wrong values because any value higher than 64 bits is likely wrong. Fixes: `223b7f1d5e` ("mem: add function for checking memseg IOVA") Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>	2018-11-07 14:42:28 +01:00
Alejandro Lucero	84e7477e10	mem: add thread unsafe version for DMA mask check During memory initialization calling rte_mem_check_dma_mask leads to a deadlock because memory_hotplug_lock is locked by a writer, the current code in execution, and rte_memseg_walk tries to lock as a reader. This patch adds a thread_unsafe version which will call the final function specifying the memory_hotplug_lock does not need to be acquired. The patch also modified rte_mem_check_dma_mask as a intermediate step which will call the final function as before, implying memory_hotplug_lock will be acquired. PMDs should always use the version acquiring the lock with the thread_unsafe one being just for internal EAL memory code. Fixes: `223b7f1d5e` ("mem: add function for checking memseg IOVA") Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com> Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>	2018-11-05 01:02:14 +01:00
Alejandro Lucero	9d15773606	mem: add function for setting DMA mask This patch adds the possibility of setting a dma mask to be used once the memory initialization is done. This is currently needed when IOVA mode is set by PCI related code and an x86 IOMMU hardware unit is present. Current code calls rte_mem_check_dma_mask but it is wrong to do so at that point because the memory has not been initialized yet. Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com> Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>	2018-11-05 01:02:04 +01:00
Alejandro Lucero	0de9eb6138	mem: rename DMA mask check with proper prefix Current name rte_eal_check_dma_mask does not follow the naming used in the rest of the file. Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com> Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>	2018-11-05 01:01:54 +01:00
Alejandro Lucero	1df2170287	mem: use address hint for mapping hugepages Linux kernel uses a really high address as starting address for serving mmaps calls. If there exist addressing limitations and IOVA mode is VA, this starting address is likely too high for those devices. However, it is possible to use a lower address in the process virtual address space as with 64 bits there is a lot of available space. This patch adds an address hint as starting address for 64 bits systems and increments the hint for next invocations. If the mmap call does not use the hint address, repeat the mmap call using the hint address incremented by page size. Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-10-28 22:06:05 +01:00
Alejandro Lucero	223b7f1d5e	mem: add function for checking memseg IOVA A device can suffer addressing limitations. This function checks memsegs have iovas within the supported range based on dma mask. PMDs should use this function during initialization if device suffers addressing limitations, returning an error if this function returns memsegs out of range. Another usage is for emulated IOMMU hardware with addressing limitations. It is necessary to save the most restricted dma mask for checking out memory allocated dynamically after initialization. Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-10-28 22:04:34 +01:00
Anatoly Burakov	3717943819	mem: improve musl compatibility When built against musl, fcntl.h doesn't silently get included. Fix by including it explicitly. Bugzilla ID: 31 Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-10-22 11:29:37 +02:00
Anatoly Burakov	5282bb1c36	mem: allow memseg lists to be marked as external When we allocate and use DPDK memory, we need to be able to differentiate between DPDK hugepage segments and segments that were made part of DPDK but are externally allocated. Add such a property to memseg lists. This breaks the ABI, so document the change in release notes. This also breaks a few internal assumptions about memory contiguousness, so adjust malloc code in a few places. All current calls for memseg walk functions were adjusted to ignore external segments where it made sense. Mempools is a special case, because we may be asked to allocate a mempool on a specific socket, and we need to ignore all page sizes on other heaps or other sockets. Previously, this assumption of knowing all page sizes was not a problem, but it will be now, so we have to match socket ID with page size when calculating minimum page size for a mempool. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Andrew Rybchenko <arybchenko@solarflare.com> Acked-by: Yongseok Koh <yskoh@mellanox.com>	2018-10-11 10:24:29 +02:00
Anatoly Burakov	4104b2a485	mem: add length to memseg list Previously, to calculate length of memory area covered by a memseg list, we would've needed to multiply page size by length of fbarray backing that memseg list. This is not obvious and unnecessarily low level, so store length in the memseg list itself. This breaks ABI, so bump the EAL ABI version and document the change. Also, while we're breaking ABI, pack the members a little better. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>	2018-10-11 10:24:16 +02:00
Anatoly Burakov	3a44687139	mem: allow querying offset into segment fd In a few cases, user may need to query offset into fd for a particular memory segment (for example, to selectively map pages). This commit adds a new API to do that. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-09-19 15:01:58 +02:00
Anatoly Burakov	41dbdb6872	mem: add external API to retrieve page fd Now that we can retrieve page fd's internally, we can expose it as an external API. This will add two flavors of API - thread-safe and non-thread-safe. Fix up internal API's to return values we need without modifying rte_errno internally if called from within EAL. We do not want calling code to accidentally close an internal fd, so we make a duplicate of it before we return it to the user. Caller is therefore responsible for closing this fd. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-09-19 14:48:04 +02:00
Anatoly Burakov	1009ba1704	mem: add internal API to get and set segment fd Enable setting and retrieving segment fd's internally. For now, retrieving fd's will not be used anywhere until we get an external API, but it will be useful for things like virtio, where we wish to share segment fd's. Setting segment fd's will not be available as a public API at this time, but internally it is needed for legacy mode, because we're not allocating our hugepages in memalloc in legacy mode case, and we still need to store the fd. Another user of get segment fd API is memseg info dump, to show which pages use which fd's. Not supported on FreeBSD. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-09-19 14:46:34 +02:00
Anatoly Burakov	d5dd22c9f6	mem: fix alignment of requested virtual areas The original code did not align any addresses that were requested as page-aligned, but were different because addr_is_hint was set. Below fix by Dariusz has introduced an issue where all unaligned addresses were left as unaligned. This patch is a partial revert of commit `7fa7216ed4` ("mem: fix alignment of requested virtual areas") and implements a proper fix for this issue, by asking for alignment in all but the following two cases: 1) page size is equal to system page size, or 2) we got an aligned requested address, and will not accept a different one This ensures that alignment is performed in all cases, except for those we can guarantee that the address will not need alignment. Fixes: `b7cc54187e` ("mem: move virtual area function in common directory") Fixes: `7fa7216ed4` ("mem: fix alignment of requested virtual areas") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Lei Yao <lei.a.yao@intel.com> Acked-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>	2018-07-18 23:22:33 +02:00
Anatoly Burakov	e26415428f	mem: provide thread-unsafe memseg list walk variant Sometimes, user code needs to walk memseg list while being inside a memory-related callback. Rather than making everyone copy around the same iteration code and depending on DPDK internals, provide an official way to do memseg_list_walk() inside callbacks. Also, remove existing reimplementation from memalloc code and use the new API instead. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:21:25 +02:00
Anatoly Burakov	7c790af08f	mem: provide thread-unsafe memseg walk variant Sometimes, user code needs to walk memseg list while being inside a memory-related callback. Rather than making everyone copy around the same iteration code and depending on DPDK internals, provide an official way to do memseg_walk() inside callbacks. Also, remove existing reimplementation from sPAPR VFIO code and use the new API instead. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:21:15 +02:00
Anatoly Burakov	b917147601	mem: provide thread-unsafe contig walk variant Sometimes, user code needs to walk memseg list while being inside a memory-related callback. Rather than making everyone copy around the same iteration code and depending on DPDK internals, provide an official way to do memseg_contig_walk() inside callbacks. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 11:20:06 +02:00
Anatoly Burakov	1d406458db	mem: make segment preallocation OS-specific In the perfect world, it wouldn't matter how much memory was preallocated because most of it was always going to be private anonymous zero-page mappings for the duration of the program. However, in practice, due to peculiarities of FreeBSD, we need to additionally limit memory allocation there. This patch moves the segment preallocation to EAL private functions that will be implemented by an OS-specific EAL rather than being in the common memory-related code. Since there is no support for growing/shrinking memory use at runtime on FreeBSD anyway, this does not inhibit any functionality but makes core dumps faster even on default settings. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:59:18 +02:00
Dariusz Stojaczyk	6c0fb7547b	mem: do not use --base-virtaddr in secondary processes Since secondary process' address space is highly dictated by the primary process' mappings, it doesn't make much sense to use base-virtaddr for secondary processes. This patch is intended to fix PCI resource mapping in secondary processes using the same base-virtaddr as their primary processes. PCI uses the end of the hugepage memory area to map all resources. [pci_find_max_end_va()] It works for primary processes, but can't be mapped 1:1 by secondary ones, as the same addresses are currently always occupied by shadow memseg lists, which were created with eal_get_virtual_area(NULL, ...). ``` PRIMARY PROCESS 0x6e00e00000 388K rw-s- fbarray_memseg-2048k-1-3 0x6e01000000 16777216K r---- [ anon ] 0x7201000000 16K rw-s- resource0 SECONDARY PROCESS 0x6e00e00000 388K rw-s- fbarray_memseg-2048k-1-3 0x6e01000000 16777216K r---- [ anon ] 0x7201000000 4K rw-s- fbarray_memseg-1048576k-0-0_203213 ``` Fixes: `524e43c2ad` ("mem: prepare memseg lists for multiprocess sync") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:13 +02:00
Dariusz Stojaczyk	9dac150f98	mem: fix alignment requested with --base-virtaddr Whenever a calculated base-virtaddr offset had to be manually aligned to requested page_sz, we did not take account of that alignment in incrementing the base-virtaddr offset further. The next requested virtual area could print a warning "hint [...] not respected!" and let the system pick an address instead. As a result, this breaks secondary process support on many system configurations. Fixes: `b7cc54187e` ("mem: move virtual area function in common directory") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:10 +02:00
Dariusz Stojaczyk	7fa7216ed4	mem: fix alignment of requested virtual areas Although the alignment mechanism works as intended, the `no_align` bool flag was set incorrectly. We were aligning buffers that didn't need extra alignment, and weren't aligning ones that really needed it. Fixes: `b7cc54187e` ("mem: move virtual area function in common directory") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:09 +02:00
Dariusz Stojaczyk	09037cf36c	mem: avoid crash on memseg query with invalid address When trying to use it with an address that's not managed by DPDK it would segfault due to a missing check. The doc says this function returns either a pointer or NULL, so let it do so. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Cc: stable@dpdk.org Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-07-13 00:25:08 +02:00
Anatoly Burakov	d2bd796d7b	mem: fix potential underflow on mem size calculation If total memory is already bigger than max memory, an underflow will occur on subtraction. Fix it by simply stopping whenever we already have amount of memory that is bigger than maximum. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2018-05-14 03:15:31 +02:00
Anatoly Burakov	68c3603867	mem: unmap unneeded space When we ask to reserve virtual areas, we usually include alignment in the mapping size, and that memory ends up being wasted. Wasting a gigabyte of VA space while trying to reserve one gigabyte is pretty expensive on 32-bit, so after we're done mapping, unmap unneeded space. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-05-14 01:32:38 +02:00
Anatoly Burakov	91fe57ac00	mem: check if allocation size is too big Mapping size is a 64-bit integer, but mmap() will accept size_t for size mappings. A user could request a mapping with an alignment, which would have overflown size_t, so check if (size + alignment) will overflow size_t. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-05-14 01:32:21 +02:00
Anatoly Burakov	0256386dc4	mem: add argument to memory event callback It may be useful to pass arbitrary data to the callback (such as device pointers), so add this to the mem event callback API. Suggested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-05-08 22:28:58 +02:00
Anatoly Burakov	046aa5c447	mem: add memalloc init stage Currently, memseg lists for secondary process are allocated on sync (triggered by init), when they are accessed for the first time. Move this initialization to a separate init stage for memalloc. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	1be7644986	mem: improve autodetection of hugepage counts on 32-bit For non-legacy mode, we are preallocating space for hugepages, so we know in advance which pages we will be able to allocate, and which we won't. However, the init procedure was using hugepage counts gathered from sysfs and paid no attention to hugepage sizes that were actually available for reservation, and failed on attempts to reserve unavailable pages. Fix this by limiting total page counts by number of pages actually preallocated. Also, VA preallocate procedure only looks at mountpoints that are available, and expects pages to exist if a mountpoint exists. That might not necessarily be the case, so also check if there are hugepages available for a particular page size on a particular NUMA node. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	e82ca1a75e	mem: improve preallocation on 32-bit Previously, if we couldn't preallocate VA space on 32-bit for one page size, we simply bailed out, even though we could've tried allocating VA space with other page sizes. For example, if user had both 1G and 2M pages enabled, and has asked DPDK to allocate memory on both sockets, DPDK would've tried to allocate VA space for 1x1G page on both sockets, failed and never tried again, even though it could've allocated the same 1G of VA space for 512x2M pages. Fix this by retrying with different page sizes if VA space reservation failed. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	a99e8df63f	mem: fix 32-bit memory upper limit for non-legacy mode 32-bit mode has an upper limit on amount of VA space it can preallocate, but the original implementation used the wrong constant, resulting in failure to initialize due to integer overflow. Fix it by using the correct constant. Fixes: `66cc45e293` ("mem: replace memseg with memseg lists") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>	2018-04-27 23:52:51 +02:00
Anatoly Burakov	8f91f368a1	mem: log page address before unmapping If user has specified a flag to unmap the area right after mapping it, we were passing an already-unmapped pointer to RTE_LOG. This is not an issue since RTE_LOG doesn't actually dereference the pointer, but fix it anyway by moving call to RTE_LOG to before unmap. Coverity issue: 272584 Fixes: `b7cc54187e` ("mem: move virtual area function in common directory") Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2018-04-27 23:42:40 +02:00
Anatoly Burakov	2e378ff297	mem: add validator callback This API will enable application to register for notifications on page allocations that are about to happen, giving the application a chance to allow or deny the allocation when total memory utilization as a result would be above specified limit on specified socket. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:56 +02:00
Anatoly Burakov	56efb4c117	malloc: support callbacks on memory events Each process will have its own callbacks. Callbacks will indicate whether it's allocation and deallocation that's happened, and will also provide start VA address and length of allocated block. Since memory hotplug isn't supported on FreeBSD and in legacy mem mode, it will not be possible to register them in either. Callbacks are called whenever something happens to the memory map of current process, therefore at those times memory hotplug subsystem is write-locked, which leads to deadlocks on attempt to use these functions. Document the limitation. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	07dcbfe010	malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	6167d81488	mem: add secondary process init with memory hotplug Secondary initialization will just sync memory map with primary process. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 21:45:55 +02:00
Anatoly Burakov	66cc45e293	mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:55:39 +02:00
Anatoly Burakov	f901e64d21	mem: add virt2memseg function This can be used as a virt2iova function that only looks up memory that is owned by DPDK (as opposed to doing pagemap walks). Using this will result in less dependency on internals of mem API. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:54:44 +02:00
Anatoly Burakov	eca28edd98	mem: add iova2virt function This is reverse lookup of PA to VA. Using this will make other code less dependent on internals of mem API. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:54:00 +02:00
Anatoly Burakov	552afc420a	mem: add contig walk function This function is meant to walk over first segment of each VA-contiguous group of memsegs. For future users of this function, this is done so that there is less dependency on internals of mem API and less noise later change sets. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:53:38 +02:00
Anatoly Burakov	221b67bca0	eal: use memseg walk instead of iteration Reduce dependency on internal details of EAL memory subsystem, and simplify code. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:48:15 +02:00
Anatoly Burakov	2b9f98d8a5	mem: add function to walk all memsegs For code that might need to iterate over list of allocated segments, using this API will make it more resilient to internal API changes and will prevent copying the same iteration code over and over again. Additionally, down the line there will be locking implemented, so users of this API will not need to care about locking either. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:47:25 +02:00
Anatoly Burakov	b7cc54187e	mem: move virtual area function in common directory Move get_virtual_area out of linuxapp EAL memory and make it common to EAL, so that other code could reserve virtual areas as well. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>	2018-04-11 19:33:06 +02:00

1 2

69 Commits