numam-dpdk/lib/librte_eal/common/eal_common_memory.c

948 lines
24 KiB
C
Raw Normal View History

/* SPDX-License-Identifier: BSD-3-Clause
* Copyright(c) 2010-2014 Intel Corporation
*/
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/queue.h>
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
#include <rte_fbarray.h>
#include <rte_memory.h>
#include <rte_eal.h>
#include <rte_eal_memconfig.h>
#include <rte_errno.h>
#include <rte_log.h>
#include "eal_memalloc.h"
#include "eal_private.h"
#include "eal_internal_cfg.h"
#include "malloc_heap.h"
/*
* Try to mmap *size bytes in /dev/zero. If it is successful, return the
* pointer to the mmap'd area and keep *size unmodified. Else, retry
* with a smaller zone: decrease *size by hugepage_sz until it reaches
* 0. In this case, return NULL. Note: this function returns an address
* which is a multiple of hugepage size.
*/
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
#define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i"
static void *next_baseaddr;
static uint64_t system_page_sz;
#ifdef RTE_ARCH_64
/*
* Linux kernel uses a really high address as starting address for serving
* mmaps calls. If there exists addressing limitations and IOVA mode is VA,
* this starting address is likely too high for those devices. However, it
* is possible to use a lower address in the process virtual address space
* as with 64 bits there is a lot of available space.
*
* Current known limitations are 39 or 40 bits. Setting the starting address
* at 4GB implies there are 508GB or 1020GB for mapping the available
* hugepages. This is likely enough for most systems, although a device with
* addressing limitations should call rte_mem_check_dma_mask for ensuring all
* memory is within supported range.
*/
static uint64_t baseaddr = 0x100000000;
#endif
mem: limit use of address hint The commit below added an address hint as starting address for 64-bit systems in case an explicit base virtual address was not set by the user. The justification for such hint was to help devices that work in VA mode and has a address range limitation to work smoothly with the eal memory subsystem. While the base address value selected may work fine for the eal initialization, it easily breaks when trying to register external memory using rte_extmem_register API. Trying to register anonymous memory on RH x86_64 machine took several minutes, during them the function eal_get_virtual_area repeatedly scanned for a good VA candidate. The attempt to guess which VA address will be free for mapping will always result in not portable, error prone code: * different application may use different libraries along w/ DPDK. One can never guess which library was called first and how much virtual memory it consumed. * external memory can be registered at any time in the application run time. In order not to break the existing secondary process design, this patch only limits the max number of tries that will be done with the address hint. When the number of tries exceeds the threshold the code will use the suggested address from kernel. Fixes: 1df21702873d ("mem: use address hint for mapping hugepages") Cc: stable@dpdk.org Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Tested-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Alejandro Lucero <alejandro.lucero@netronome.com>
2019-03-31 08:43:48 +00:00
#define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5
void *
eal_get_virtual_area(void *requested_addr, size_t *size,
size_t page_sz, int flags, int mmap_flags)
{
bool addr_is_hint, allow_shrink, unmap, no_align;
uint64_t map_sz;
void *mapped_addr, *aligned_addr;
mem: limit use of address hint The commit below added an address hint as starting address for 64-bit systems in case an explicit base virtual address was not set by the user. The justification for such hint was to help devices that work in VA mode and has a address range limitation to work smoothly with the eal memory subsystem. While the base address value selected may work fine for the eal initialization, it easily breaks when trying to register external memory using rte_extmem_register API. Trying to register anonymous memory on RH x86_64 machine took several minutes, during them the function eal_get_virtual_area repeatedly scanned for a good VA candidate. The attempt to guess which VA address will be free for mapping will always result in not portable, error prone code: * different application may use different libraries along w/ DPDK. One can never guess which library was called first and how much virtual memory it consumed. * external memory can be registered at any time in the application run time. In order not to break the existing secondary process design, this patch only limits the max number of tries that will be done with the address hint. When the number of tries exceeds the threshold the code will use the suggested address from kernel. Fixes: 1df21702873d ("mem: use address hint for mapping hugepages") Cc: stable@dpdk.org Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Tested-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Alejandro Lucero <alejandro.lucero@netronome.com>
2019-03-31 08:43:48 +00:00
uint8_t try = 0;
if (system_page_sz == 0)
system_page_sz = sysconf(_SC_PAGESIZE);
mmap_flags |= MAP_PRIVATE | MAP_ANONYMOUS;
RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size);
addr_is_hint = (flags & EAL_VIRTUAL_AREA_ADDR_IS_HINT) > 0;
allow_shrink = (flags & EAL_VIRTUAL_AREA_ALLOW_SHRINK) > 0;
unmap = (flags & EAL_VIRTUAL_AREA_UNMAP) > 0;
if (next_baseaddr == NULL && internal_config.base_virtaddr != 0 &&
rte_eal_process_type() == RTE_PROC_PRIMARY)
next_baseaddr = (void *) internal_config.base_virtaddr;
#ifdef RTE_ARCH_64
if (next_baseaddr == NULL && internal_config.base_virtaddr == 0 &&
rte_eal_process_type() == RTE_PROC_PRIMARY)
next_baseaddr = (void *) baseaddr;
#endif
if (requested_addr == NULL && next_baseaddr != NULL) {
requested_addr = next_baseaddr;
requested_addr = RTE_PTR_ALIGN(requested_addr, page_sz);
addr_is_hint = true;
}
/* we don't need alignment of resulting pointer in the following cases:
*
* 1. page size is equal to system size
* 2. we have a requested address, and it is page-aligned, and we will
* be discarding the address if we get a different one.
*
* for all other cases, alignment is potentially necessary.
*/
no_align = (requested_addr != NULL &&
requested_addr == RTE_PTR_ALIGN(requested_addr, page_sz) &&
!addr_is_hint) ||
page_sz == system_page_sz;
do {
map_sz = no_align ? *size : *size + page_sz;
if (map_sz > SIZE_MAX) {
RTE_LOG(ERR, EAL, "Map size too big\n");
rte_errno = E2BIG;
return NULL;
}
mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_READ,
mmap_flags, -1, 0);
if (mapped_addr == MAP_FAILED && allow_shrink)
*size -= page_sz;
if (mapped_addr != MAP_FAILED && addr_is_hint &&
mapped_addr != requested_addr) {
mem: limit use of address hint The commit below added an address hint as starting address for 64-bit systems in case an explicit base virtual address was not set by the user. The justification for such hint was to help devices that work in VA mode and has a address range limitation to work smoothly with the eal memory subsystem. While the base address value selected may work fine for the eal initialization, it easily breaks when trying to register external memory using rte_extmem_register API. Trying to register anonymous memory on RH x86_64 machine took several minutes, during them the function eal_get_virtual_area repeatedly scanned for a good VA candidate. The attempt to guess which VA address will be free for mapping will always result in not portable, error prone code: * different application may use different libraries along w/ DPDK. One can never guess which library was called first and how much virtual memory it consumed. * external memory can be registered at any time in the application run time. In order not to break the existing secondary process design, this patch only limits the max number of tries that will be done with the address hint. When the number of tries exceeds the threshold the code will use the suggested address from kernel. Fixes: 1df21702873d ("mem: use address hint for mapping hugepages") Cc: stable@dpdk.org Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Tested-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Alejandro Lucero <alejandro.lucero@netronome.com>
2019-03-31 08:43:48 +00:00
try++;
next_baseaddr = RTE_PTR_ADD(next_baseaddr, page_sz);
mem: limit use of address hint The commit below added an address hint as starting address for 64-bit systems in case an explicit base virtual address was not set by the user. The justification for such hint was to help devices that work in VA mode and has a address range limitation to work smoothly with the eal memory subsystem. While the base address value selected may work fine for the eal initialization, it easily breaks when trying to register external memory using rte_extmem_register API. Trying to register anonymous memory on RH x86_64 machine took several minutes, during them the function eal_get_virtual_area repeatedly scanned for a good VA candidate. The attempt to guess which VA address will be free for mapping will always result in not portable, error prone code: * different application may use different libraries along w/ DPDK. One can never guess which library was called first and how much virtual memory it consumed. * external memory can be registered at any time in the application run time. In order not to break the existing secondary process design, this patch only limits the max number of tries that will be done with the address hint. When the number of tries exceeds the threshold the code will use the suggested address from kernel. Fixes: 1df21702873d ("mem: use address hint for mapping hugepages") Cc: stable@dpdk.org Signed-off-by: Shahaf Shuler <shahafs@mellanox.com> Tested-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Alejandro Lucero <alejandro.lucero@netronome.com>
2019-03-31 08:43:48 +00:00
if (try <= MAX_MMAP_WITH_DEFINED_ADDR_TRIES) {
/* hint was not used. Try with another offset */
munmap(mapped_addr, map_sz);
mapped_addr = MAP_FAILED;
requested_addr = next_baseaddr;
}
}
} while ((allow_shrink || addr_is_hint) &&
mapped_addr == MAP_FAILED && *size > 0);
/* align resulting address - if map failed, we will ignore the value
* anyway, so no need to add additional checks.
*/
aligned_addr = no_align ? mapped_addr :
RTE_PTR_ALIGN(mapped_addr, page_sz);
if (*size == 0) {
RTE_LOG(ERR, EAL, "Cannot get a virtual area of any size: %s\n",
strerror(errno));
rte_errno = errno;
return NULL;
} else if (mapped_addr == MAP_FAILED) {
RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n",
strerror(errno));
/* pass errno up the call chain */
rte_errno = errno;
return NULL;
} else if (requested_addr != NULL && !addr_is_hint &&
aligned_addr != requested_addr) {
RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p (got %p)\n",
requested_addr, aligned_addr);
munmap(mapped_addr, map_sz);
rte_errno = EADDRNOTAVAIL;
return NULL;
} else if (requested_addr != NULL && addr_is_hint &&
aligned_addr != requested_addr) {
RTE_LOG(WARNING, EAL, "WARNING! Base virtual address hint (%p != %p) not respected!\n",
requested_addr, aligned_addr);
RTE_LOG(WARNING, EAL, " This may cause issues with mapping memory into secondary processes\n");
} else if (next_baseaddr != NULL) {
next_baseaddr = RTE_PTR_ADD(aligned_addr, *size);
}
RTE_LOG(DEBUG, EAL, "Virtual area found at %p (size = 0x%zx)\n",
aligned_addr, *size);
if (unmap) {
munmap(mapped_addr, map_sz);
} else if (!no_align) {
void *map_end, *aligned_end;
size_t before_len, after_len;
/* when we reserve space with alignment, we add alignment to
* mapping size. On 32-bit, if 1GB alignment was requested, this
* would waste 1GB of address space, which is a luxury we cannot
* afford. so, if alignment was performed, check if any unneeded
* address space can be unmapped back.
*/
map_end = RTE_PTR_ADD(mapped_addr, (size_t)map_sz);
aligned_end = RTE_PTR_ADD(aligned_addr, *size);
/* unmap space before aligned mmap address */
before_len = RTE_PTR_DIFF(aligned_addr, mapped_addr);
if (before_len > 0)
munmap(mapped_addr, before_len);
/* unmap space after aligned end mmap address */
after_len = RTE_PTR_DIFF(map_end, aligned_end);
if (after_len > 0)
munmap(aligned_end, after_len);
}
return aligned_addr;
}
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
static struct rte_memseg *
virt2memseg(const void *addr, const struct rte_memseg_list *msl)
{
const struct rte_fbarray *arr;
void *start, *end;
int ms_idx;
if (msl == NULL)
return NULL;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
/* a memseg list was specified, check if it's the right one */
start = msl->base_va;
end = RTE_PTR_ADD(start, msl->len);
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
if (addr < start || addr >= end)
return NULL;
/* now, calculate index */
arr = &msl->memseg_arr;
ms_idx = RTE_PTR_DIFF(addr, msl->base_va) / msl->page_sz;
return rte_fbarray_get(arr, ms_idx);
}
static struct rte_memseg_list *
virt2memseg_list(const void *addr)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
struct rte_memseg_list *msl;
int msl_idx;
for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) {
void *start, *end;
msl = &mcfg->memsegs[msl_idx];
start = msl->base_va;
end = RTE_PTR_ADD(start, msl->len);
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
if (addr >= start && addr < end)
break;
}
/* if we didn't find our memseg list */
if (msl_idx == RTE_MAX_MEMSEG_LISTS)
return NULL;
return msl;
}
__rte_experimental struct rte_memseg_list *
rte_mem_virt2memseg_list(const void *addr)
{
return virt2memseg_list(addr);
}
struct virtiova {
rte_iova_t iova;
void *virt;
};
static int
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
find_virt(const struct rte_memseg_list *msl __rte_unused,
const struct rte_memseg *ms, void *arg)
{
struct virtiova *vi = arg;
if (vi->iova >= ms->iova && vi->iova < (ms->iova + ms->len)) {
size_t offset = vi->iova - ms->iova;
vi->virt = RTE_PTR_ADD(ms->addr, offset);
/* stop the walk */
return 1;
}
return 0;
}
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
static int
find_virt_legacy(const struct rte_memseg_list *msl __rte_unused,
const struct rte_memseg *ms, size_t len, void *arg)
{
struct virtiova *vi = arg;
if (vi->iova >= ms->iova && vi->iova < (ms->iova + len)) {
size_t offset = vi->iova - ms->iova;
vi->virt = RTE_PTR_ADD(ms->addr, offset);
/* stop the walk */
return 1;
}
return 0;
}
__rte_experimental void *
rte_mem_iova2virt(rte_iova_t iova)
{
struct virtiova vi;
memset(&vi, 0, sizeof(vi));
vi.iova = iova;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
/* for legacy mem, we can get away with scanning VA-contiguous segments,
* as we know they are PA-contiguous as well
*/
if (internal_config.legacy_mem)
rte_memseg_contig_walk(find_virt_legacy, &vi);
else
rte_memseg_walk(find_virt, &vi);
return vi.virt;
}
__rte_experimental struct rte_memseg *
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
rte_mem_virt2memseg(const void *addr, const struct rte_memseg_list *msl)
{
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
return virt2memseg(addr, msl != NULL ? msl :
rte_mem_virt2memseg_list(addr));
}
static int
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
physmem_size(const struct rte_memseg_list *msl, void *arg)
{
uint64_t *total_len = arg;
if (msl->external)
return 0;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
*total_len += msl->memseg_arr.count * msl->page_sz;
return 0;
}
/* get the total size of memory */
uint64_t
rte_eal_get_physmem_size(void)
{
uint64_t total_len = 0;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
rte_memseg_list_walk(physmem_size, &total_len);
return total_len;
}
static int
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
dump_memseg(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
void *arg)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
int msl_idx, ms_idx, fd;
FILE *f = arg;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
msl_idx = msl - mcfg->memsegs;
if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS)
return -1;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
ms_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
if (ms_idx < 0)
return -1;
fd = eal_memalloc_get_seg_fd(msl_idx, ms_idx);
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
fprintf(f, "Segment %i-%i: IOVA:0x%"PRIx64", len:%zu, "
"virt:%p, socket_id:%"PRId32", "
"hugepage_sz:%"PRIu64", nchannel:%"PRIx32", "
"nrank:%"PRIx32" fd:%i\n",
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
msl_idx, ms_idx,
ms->iova,
ms->len,
ms->addr,
ms->socket_id,
ms->hugepage_sz,
ms->nchannel,
ms->nrank,
fd);
return 0;
}
/*
* Defining here because declared in rte_memory.h, but the actual implementation
* is in eal_common_memalloc.c, like all other memalloc internals.
*/
int __rte_experimental
rte_mem_event_callback_register(const char *name, rte_mem_event_callback_t clb,
void *arg)
{
/* FreeBSD boots with legacy mem enabled by default */
if (internal_config.legacy_mem) {
RTE_LOG(DEBUG, EAL, "Registering mem event callbacks not supported\n");
rte_errno = ENOTSUP;
return -1;
}
return eal_memalloc_mem_event_callback_register(name, clb, arg);
}
int __rte_experimental
rte_mem_event_callback_unregister(const char *name, void *arg)
{
/* FreeBSD boots with legacy mem enabled by default */
if (internal_config.legacy_mem) {
RTE_LOG(DEBUG, EAL, "Registering mem event callbacks not supported\n");
rte_errno = ENOTSUP;
return -1;
}
return eal_memalloc_mem_event_callback_unregister(name, arg);
}
int __rte_experimental
rte_mem_alloc_validator_register(const char *name,
rte_mem_alloc_validator_t clb, int socket_id, size_t limit)
{
/* FreeBSD boots with legacy mem enabled by default */
if (internal_config.legacy_mem) {
RTE_LOG(DEBUG, EAL, "Registering mem alloc validators not supported\n");
rte_errno = ENOTSUP;
return -1;
}
return eal_memalloc_mem_alloc_validator_register(name, clb, socket_id,
limit);
}
int __rte_experimental
rte_mem_alloc_validator_unregister(const char *name, int socket_id)
{
/* FreeBSD boots with legacy mem enabled by default */
if (internal_config.legacy_mem) {
RTE_LOG(DEBUG, EAL, "Registering mem alloc validators not supported\n");
rte_errno = ENOTSUP;
return -1;
}
return eal_memalloc_mem_alloc_validator_unregister(name, socket_id);
}
/* Dump the physical memory layout on console */
void
rte_dump_physmem_layout(FILE *f)
{
rte_memseg_walk(dump_memseg, f);
}
static int
check_iova(const struct rte_memseg_list *msl __rte_unused,
const struct rte_memseg *ms, void *arg)
{
uint64_t *mask = arg;
rte_iova_t iova;
/* higher address within segment */
iova = (ms->iova + ms->len) - 1;
if (!(iova & *mask))
return 0;
RTE_LOG(DEBUG, EAL, "memseg iova %"PRIx64", len %zx, out of range\n",
ms->iova, ms->len);
RTE_LOG(DEBUG, EAL, "\tusing dma mask %"PRIx64"\n", *mask);
return 1;
}
#define MAX_DMA_MASK_BITS 63
/* check memseg iovas are within the required range based on dma mask */
static int __rte_experimental
check_dma_mask(uint8_t maskbits, bool thread_unsafe)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
uint64_t mask;
int ret;
/* Sanity check. We only check width can be managed with 64 bits
* variables. Indeed any higher value is likely wrong. */
if (maskbits > MAX_DMA_MASK_BITS) {
RTE_LOG(ERR, EAL, "wrong dma mask size %u (Max: %u)\n",
maskbits, MAX_DMA_MASK_BITS);
return -1;
}
/* create dma mask */
mask = ~((1ULL << maskbits) - 1);
if (thread_unsafe)
ret = rte_memseg_walk_thread_unsafe(check_iova, &mask);
else
ret = rte_memseg_walk(check_iova, &mask);
if (ret)
/*
* Dma mask precludes hugepage usage.
* This device can not be used and we do not need to keep
* the dma mask.
*/
return 1;
/*
* we need to keep the more restricted maskbit for checking
* potential dynamic memory allocation in the future.
*/
mcfg->dma_maskbits = mcfg->dma_maskbits == 0 ? maskbits :
RTE_MIN(mcfg->dma_maskbits, maskbits);
return 0;
}
int __rte_experimental
rte_mem_check_dma_mask(uint8_t maskbits)
{
return check_dma_mask(maskbits, false);
}
int __rte_experimental
rte_mem_check_dma_mask_thread_unsafe(uint8_t maskbits)
{
return check_dma_mask(maskbits, true);
}
/*
* Set dma mask to use when memory initialization is done.
*
* This function should ONLY be used by code executed before the memory
* initialization. PMDs should use rte_mem_check_dma_mask if addressing
* limitations by the device.
*/
void __rte_experimental
rte_mem_set_dma_mask(uint8_t maskbits)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
mcfg->dma_maskbits = mcfg->dma_maskbits == 0 ? maskbits :
RTE_MIN(mcfg->dma_maskbits, maskbits);
}
/* return the number of memory channels */
unsigned rte_memory_get_nchannel(void)
{
return rte_eal_get_configuration()->mem_config->nchannel;
}
/* return the number of memory rank */
unsigned rte_memory_get_nrank(void)
{
return rte_eal_get_configuration()->mem_config->nrank;
}
static int
rte_eal_memdevice_init(void)
{
struct rte_config *config;
if (rte_eal_process_type() == RTE_PROC_SECONDARY)
return 0;
config = rte_eal_get_configuration();
config->mem_config->nchannel = internal_config.force_nchannel;
config->mem_config->nrank = internal_config.force_nrank;
return 0;
}
/* Lock page in physical memory and prevent from swapping. */
int
rte_mem_lock_page(const void *virt)
{
unsigned long virtual = (unsigned long)virt;
int page_size = getpagesize();
unsigned long aligned = (virtual & ~(page_size - 1));
return mlock((void *)aligned, page_size);
}
int __rte_experimental
rte_memseg_contig_walk_thread_unsafe(rte_memseg_contig_walk_t func, void *arg)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
int i, ms_idx, ret = 0;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
struct rte_memseg_list *msl = &mcfg->memsegs[i];
const struct rte_memseg *ms;
struct rte_fbarray *arr;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
if (msl->memseg_arr.count == 0)
continue;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
arr = &msl->memseg_arr;
ms_idx = rte_fbarray_find_next_used(arr, 0);
while (ms_idx >= 0) {
int n_segs;
size_t len;
ms = rte_fbarray_get(arr, ms_idx);
/* find how many more segments there are, starting with
* this one.
*/
n_segs = rte_fbarray_find_contig_used(arr, ms_idx);
len = n_segs * msl->page_sz;
ret = func(msl, ms, len, arg);
if (ret)
return ret;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
ms_idx = rte_fbarray_find_next_used(arr,
ms_idx + n_segs);
}
}
return 0;
}
int __rte_experimental
rte_memseg_contig_walk(rte_memseg_contig_walk_t func, void *arg)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
int ret = 0;
/* do not allow allocations/frees/init while we iterate */
rte_rwlock_read_lock(&mcfg->memory_hotplug_lock);
ret = rte_memseg_contig_walk_thread_unsafe(func, arg);
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
return ret;
}
int __rte_experimental
rte_memseg_walk_thread_unsafe(rte_memseg_walk_t func, void *arg)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
int i, ms_idx, ret = 0;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
struct rte_memseg_list *msl = &mcfg->memsegs[i];
const struct rte_memseg *ms;
struct rte_fbarray *arr;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
if (msl->memseg_arr.count == 0)
continue;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
arr = &msl->memseg_arr;
ms_idx = rte_fbarray_find_next_used(arr, 0);
while (ms_idx >= 0) {
ms = rte_fbarray_get(arr, ms_idx);
ret = func(msl, ms, arg);
if (ret)
return ret;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
ms_idx = rte_fbarray_find_next_used(arr, ms_idx + 1);
}
}
return 0;
}
int __rte_experimental
rte_memseg_walk(rte_memseg_walk_t func, void *arg)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
int ret = 0;
/* do not allow allocations/frees/init while we iterate */
rte_rwlock_read_lock(&mcfg->memory_hotplug_lock);
ret = rte_memseg_walk_thread_unsafe(func, arg);
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
return ret;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
}
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
int __rte_experimental
rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg)
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
int i, ret = 0;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
struct rte_memseg_list *msl = &mcfg->memsegs[i];
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
if (msl->base_va == NULL)
continue;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
ret = func(msl, arg);
if (ret)
return ret;
}
return 0;
}
int __rte_experimental
rte_memseg_list_walk(rte_memseg_list_walk_t func, void *arg)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
int ret = 0;
/* do not allow allocations/frees/init while we iterate */
rte_rwlock_read_lock(&mcfg->memory_hotplug_lock);
ret = rte_memseg_list_walk_thread_unsafe(func, arg);
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
return ret;
}
int __rte_experimental
rte_memseg_get_fd_thread_unsafe(const struct rte_memseg *ms)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
struct rte_memseg_list *msl;
struct rte_fbarray *arr;
int msl_idx, seg_idx, ret;
if (ms == NULL) {
rte_errno = EINVAL;
return -1;
}
msl = rte_mem_virt2memseg_list(ms->addr);
if (msl == NULL) {
rte_errno = EINVAL;
return -1;
}
arr = &msl->memseg_arr;
msl_idx = msl - mcfg->memsegs;
seg_idx = rte_fbarray_find_idx(arr, ms);
if (!rte_fbarray_is_used(arr, seg_idx)) {
rte_errno = ENOENT;
return -1;
}
/* segment fd API is not supported for external segments */
if (msl->external) {
rte_errno = ENOTSUP;
return -1;
}
ret = eal_memalloc_get_seg_fd(msl_idx, seg_idx);
if (ret < 0) {
rte_errno = -ret;
ret = -1;
}
return ret;
}
int __rte_experimental
rte_memseg_get_fd(const struct rte_memseg *ms)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
int ret;
rte_rwlock_read_lock(&mcfg->memory_hotplug_lock);
ret = rte_memseg_get_fd_thread_unsafe(ms);
rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
return ret;
}
int __rte_experimental
rte_memseg_get_fd_offset_thread_unsafe(const struct rte_memseg *ms,
size_t *offset)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
struct rte_memseg_list *msl;
struct rte_fbarray *arr;
int msl_idx, seg_idx, ret;
if (ms == NULL || offset == NULL) {
rte_errno = EINVAL;
return -1;
}
msl = rte_mem_virt2memseg_list(ms->addr);
if (msl == NULL) {
rte_errno = EINVAL;
return -1;
}
arr = &msl->memseg_arr;
msl_idx = msl - mcfg->memsegs;
seg_idx = rte_fbarray_find_idx(arr, ms);
if (!rte_fbarray_is_used(arr, seg_idx)) {
rte_errno = ENOENT;
return -1;
}
/* segment fd API is not supported for external segments */
if (msl->external) {
rte_errno = ENOTSUP;
return -1;
}
ret = eal_memalloc_get_seg_fd_offset(msl_idx, seg_idx, offset);
if (ret < 0) {
rte_errno = -ret;
ret = -1;
}
return ret;
}
int __rte_experimental
rte_memseg_get_fd_offset(const struct rte_memseg *ms, size_t *offset)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
int ret;
rte_rwlock_read_lock(&mcfg->memory_hotplug_lock);
ret = rte_memseg_get_fd_offset_thread_unsafe(ms, offset);
rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
return ret;
}
int __rte_experimental
rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[],
unsigned int n_pages, size_t page_sz)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
unsigned int socket_id, n;
int ret = 0;
if (va_addr == NULL || page_sz == 0 || len == 0 ||
!rte_is_power_of_2(page_sz) ||
RTE_ALIGN(len, page_sz) != len ||
((len / page_sz) != n_pages && iova_addrs != NULL) ||
!rte_is_aligned(va_addr, page_sz)) {
rte_errno = EINVAL;
return -1;
}
rte_rwlock_write_lock(&mcfg->memory_hotplug_lock);
/* make sure the segment doesn't already exist */
if (malloc_heap_find_external_seg(va_addr, len) != NULL) {
rte_errno = EEXIST;
ret = -1;
goto unlock;
}
/* get next available socket ID */
socket_id = mcfg->next_socket_id;
if (socket_id > INT32_MAX) {
RTE_LOG(ERR, EAL, "Cannot assign new socket ID's\n");
rte_errno = ENOSPC;
ret = -1;
goto unlock;
}
/* we can create a new memseg */
n = len / page_sz;
if (malloc_heap_create_external_seg(va_addr, iova_addrs, n,
page_sz, "extmem", socket_id) == NULL) {
ret = -1;
goto unlock;
}
/* memseg list successfully created - increment next socket ID */
mcfg->next_socket_id++;
unlock:
rte_rwlock_write_unlock(&mcfg->memory_hotplug_lock);
return ret;
}
int __rte_experimental
rte_extmem_unregister(void *va_addr, size_t len)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
struct rte_memseg_list *msl;
int ret = 0;
if (va_addr == NULL || len == 0) {
rte_errno = EINVAL;
return -1;
}
rte_rwlock_write_lock(&mcfg->memory_hotplug_lock);
/* find our segment */
msl = malloc_heap_find_external_seg(va_addr, len);
if (msl == NULL) {
rte_errno = ENOENT;
ret = -1;
goto unlock;
}
ret = malloc_heap_destroy_external_seg(msl);
unlock:
rte_rwlock_write_unlock(&mcfg->memory_hotplug_lock);
return ret;
}
static int
sync_memory(void *va_addr, size_t len, bool attach)
{
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
struct rte_memseg_list *msl;
int ret = 0;
if (va_addr == NULL || len == 0) {
rte_errno = EINVAL;
return -1;
}
rte_rwlock_write_lock(&mcfg->memory_hotplug_lock);
/* find our segment */
msl = malloc_heap_find_external_seg(va_addr, len);
if (msl == NULL) {
rte_errno = ENOENT;
ret = -1;
goto unlock;
}
if (attach)
ret = rte_fbarray_attach(&msl->memseg_arr);
else
ret = rte_fbarray_detach(&msl->memseg_arr);
unlock:
rte_rwlock_write_unlock(&mcfg->memory_hotplug_lock);
return ret;
}
int __rte_experimental
rte_extmem_attach(void *va_addr, size_t len)
{
return sync_memory(va_addr, len, true);
}
int __rte_experimental
rte_extmem_detach(void *va_addr, size_t len)
{
return sync_memory(va_addr, len, false);
}
/* init memory subsystem */
int
rte_eal_memory_init(void)
{
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
int retval;
RTE_LOG(DEBUG, EAL, "Setting up physically contiguous memory...\n");
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
if (!mcfg)
return -1;
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
/* lock mem hotplug here, to prevent races while we init */
rte_rwlock_read_lock(&mcfg->memory_hotplug_lock);
if (rte_eal_memseg_init() < 0)
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
goto fail;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
if (eal_memalloc_init() < 0)
goto fail;
mem: replace memseg with memseg lists Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets one or more memseg lists, per socket. In order to support dynamic memory allocation, we reserve all memory in advance (unless we're in 32-bit legacy mode, in which case we do not preallocate memory). As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size, per socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the smaller one), split over multiple lists (which are limited to either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST megabytes per list, whichever is the smaller one). There is also a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly used for 32-bit targets to limit amounts of preallocated memory, but can be used to place an upper limit on total amount of VA memory that can be allocated by DPDK application. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket, split into chunks of up to 32G in size. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only), and largely consists of copied EAL memory init code. Pages in the list are also indexed by address. That is, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. This commit also removes rte_eal_dump_physmem_layout() call, according to deprecation notice [1], and removes that deprecation notice as well. On 32-bit targets due to limited VA space, DPDK will no longer spread memory to different sockets like before. Instead, it will (by default) allocate all of the memory on socket where master lcore is. To override this behavior, --socket-mem must be used. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Due to earlier switch to _walk() functions, most of the changes are simple fixes, however some of the _walk() calls were switched to memseg list walk, where it made sense to do so. Additionally, we are also switching locks from flock() to fcntl(). Down the line, we will be introducing single-file segments option, and we cannot use flock() locks to lock parts of the file. Therefore, we will use fcntl() locks for legacy mem as well, in case someone is unfortunate enough to accidentally start legacy mem primary process alongside an already working non-legacy mem-based primary process. [1] http://dpdk.org/dev/patchwork/patch/34002/ Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:24 +00:00
retval = rte_eal_process_type() == RTE_PROC_PRIMARY ?
rte_eal_hugepage_init() :
rte_eal_hugepage_attach();
if (retval < 0)
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
goto fail;
if (internal_config.no_shconf == 0 && rte_eal_memdevice_init() < 0)
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
goto fail;
return 0;
malloc: support multiprocess memory hotplug This enables multiprocess synchronization for memory hotplug requests at runtime (as opposed to initialization). Basic workflow is the following. Primary process always does initial mapping and unmapping, and secondary processes always follow primary page map. Only one allocation request can be active at any one time. When primary allocates memory, it ensures that all other processes have allocated the same set of hugepages successfully, otherwise any allocations made are being rolled back, and heap is freed back. Heap is locked throughout the process, and there is also a global memory hotplug lock, so no race conditions can happen. When primary frees memory, it frees the heap, deallocates affected pages, and notifies other processes of deallocations. Since heap is freed from that memory chunk, the area basically becomes invisible to other processes even if they happen to fail to unmap that specific set of pages, so it's completely safe to ignore results of sync requests. When secondary allocates memory, it does not do so by itself. Instead, it sends a request to primary process to try and allocate pages of specified size and on specified socket, such that a specified heap allocation request could complete. Primary process then sends all secondaries (including the requestor) a separate notification of allocated pages, and expects all secondary processes to report success before considering pages as "allocated". Only after primary process ensures that all memory has been successfully allocated in all secondary process, it will respond positively to the initial request, and let secondary proceed with the allocation. Since the heap now has memory that can satisfy allocation request, and it was locked all this time (so no other allocations could take place), secondary process will be able to allocate memory from the heap. When secondary frees memory, it hides pages to be deallocated from the heap. Then, it sends a deallocation request to primary process, so that it deallocates pages itself, and then sends a separate sync request to all other processes (including the requestor) to unmap the same pages. This way, even if secondary fails to notify other processes of this deallocation, that memory will become invisible to other processes, and will not be allocated from again. So, to summarize: address space will only become part of the heap if primary process can ensure that all other processes have allocated this memory successfully. If anything goes wrong, the worst thing that could happen is that a page will "leak" and will not be available to neither DPDK nor the system, as some process will still hold onto it. It's not an actual leak, as we can account for the page - it's just that none of the processes will be able to use this page for anything useful, until it gets allocated from by the primary. Due to underlying DPDK IPC implementation being single-threaded, some asynchronous magic had to be done, as we need to complete several requests before we can definitively allow secondary process to use allocated memory (namely, it has to be present in all other secondary processes before it can be used). Additionally, only one allocation request is allowed to be submitted at once. Memory allocation requests are only allowed when there are no secondary processes currently initializing. To enforce that, a shared rwlock is used, that is set to read lock on init (so that several secondaries could initialize concurrently), and write lock on making allocation requests (so that either secondary init will have to wait, or allocation request will have to wait until all processes have initialized). Any other function that wishes to iterate over memory or prevent allocations should be using memory hotplug lock. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com> Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 12:30:36 +00:00
fail:
rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock);
return -1;
}