ARM is supporting maximum 4 hugepage sizes (64K, 2M, 32M
and 1G) when granule is 4KB since very long and DPDK
support maximum 3 hugepage sizes.
With all 4 hugepage sizes enabled, applications and some
stacks like VPP which are working over DPDK and using
"in-memory" eal option, or using separate mount points
on ARM based platform, fails at huge page initialization,
reporting error messages from eal:
EAL: FATAL: Cannot get hugepage information.
EAL: Cannot get hugepage information.
EAL: Error - exiting with code: 1
This issue is originated from Linux 5.0
(a21b0b78eaf7 "arm64: hugetlb: Register hugepages during arch init")
where kernel is by default creating directories for each supported
hugepage size in /sys/kernel/mm/hugepages/
On earlier Stable Kernel LTR's, the directories visible in
/sys/kernel/mm/hugepages/ were dependent upon what hugepage
sizes are configured at boot time.
This change increases the maximum supported hugepage sizes
to 4 for ARM based platforms.
Cc: stable@dpdk.org
Signed-off-by: Gagandeep Singh <g.singh@nxp.com>
Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
Spawning the ctrl threads on anything that is not part of the eal
coremask is not that polite to the rest of the system, especially
when you took good care to pin your processes on cpu resources with
tools like taskset (linux) / cpuset (freebsd).
Rather than introduce yet another eal options to control on which cpu
those ctrl threads are created, let's take the startup cpu affinity
as a reference and remove the eal coremask from it.
If no cpu is left, then we default to the master core.
The cpuset is computed once at init before the original cpu affinity
is lost.
Introduced a RTE_CPU_AND macro to abstract the differences between linux
and freebsd respective macros.
Examples in a 4 cores FreeBSD vm:
$ ./build/app/testpmd -l 2,3 --no-huge --no-pci -m 512 \
-- -i --total-num-mbufs=2048
$ procstat -S 1057
PID TID COMM TDNAME CPU CSID CPU MASK
1057 100131 testpmd - 2 1 2
1057 100140 testpmd eal-intr-thread 1 1 0-1
1057 100141 testpmd rte_mp_handle 1 1 0-1
1057 100142 testpmd lcore-slave-3 3 1 3
$ cpuset -l 1,2,3 ./build/app/testpmd -l 2,3 --no-huge --no-pci -m 512 \
-- -i --total-num-mbufs=2048
$ procstat -S 1061
PID TID COMM TDNAME CPU CSID CPU MASK
1061 100131 testpmd - 2 2 2
1061 100144 testpmd eal-intr-thread 1 2 1
1061 100145 testpmd rte_mp_handle 1 2 1
1061 100147 testpmd lcore-slave-3 3 2 3
$ cpuset -l 2,3 ./build/app/testpmd -l 2,3 --no-huge --no-pci -m 512 \
-- -i --total-num-mbufs=2048
$ procstat -S 1065
PID TID COMM TDNAME CPU CSID CPU MASK
1065 100131 testpmd - 2 2 2
1065 100148 testpmd eal-intr-thread 2 2 2
1065 100149 testpmd rte_mp_handle 2 2 2
1065 100150 testpmd lcore-slave-3 3 2 3
Fixes: d651ee4919 ("eal: set affinity for control threads")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
Currently, we use strdup in a few places to store command-line
parameter values for certain internal config values. There are
several issues with that.
First of all, they're never freed, so memory ends up leaking
either after EAL exit, or when these command-line options are
supplied multiple times.
Second of all, they're defined as `const char *`, so they
*cannot* be freed even if we wanted to.
Finally, strdup may return NULL, which will be stored in the
config. For most fields, NULL is a valid value, but for the
default prefix, the value is always expected to be valid.
To fix all of this, three things are done. First, we change
the definitions of these values to `char *` as opposed to
`const char *`. This does not break the ABI, and previous
code assumes constness (which is more restrictive), so it's
safe to do so.
Then, fix all usages of strdup to check return value, and add
a cleanup function that will free the memory occupied by
these strings, as well as freeing them before assigning a new
value to prevent leaks when parameter is specified multiple
times.
And finally, add an internal API to query hugefile prefix, so
that, absent of a valid value, a default value will be
returned, and also fix up all usages of hugefile prefix to
use this API instead of accessing hugefile prefix directly.
Bugzilla ID: 108
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
SPDK uses the rte_mem_event_callback_register API to
create RDMA memory regions (MRs) for newly allocated regions
of memory. This is used in both the SPDK NVMe-oF target
and the NVMe-oF host driver.
DPDK creates internal malloc_elem structures for these
allocated regions. As users malloc and free memory, DPDK
will sometimes merge malloc_elems that originated from
different allocations that were notified through the
registered mem_event callback routine. This results
in subsequent allocations that can span across multiple
RDMA MRs. This requires SPDK to check each DPDK buffer to
see if it crosses an MR boundary, and if so, would have to
add considerable logic and complexity to describe that
buffer before it can be accessed by the RNIC. It is somewhat
analagous to rte_malloc returning a buffer that is not
IOVA-contiguous.
As a malloc_elem gets split and some of these elements
get freed, it can also result in DPDK sending an
RTE_MEM_EVENT_FREE notification for a subset of the
original RTE_MEM_EVENT_ALLOC notification. This is also
problematic for RDMA memory regions, since unregistering
the memory region is all-or-nothing. It is not possible
to unregister part of a memory region.
To support these types of applications, this patch adds
a new --match-allocations EAL init flag. When this
flag is specified, malloc elements from different
hugepage allocations will never be merged. Memory will
also only be freed back to the system (with the requisite
memory event callback) exactly as it was originally
allocated.
Since part of this patch is extending the size of struct
malloc_elem, we also fix up the malloc autotests so they
do not assume its size exactly fits in one cacheline.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
In the case of user don't want to use bus iova scheme and want
to override.
For that, adding EAL option --iova-mode=<string> where valid input
string is 'pa' or 'va'.
Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Signed-off-by: Eric Zhang <eric.zhang@windriver.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
This command-line option will cause DPDK to operate entirely in
memory and not create any shared files at runtime, including any
shared configuration or hugetlbfs files. This is useful for debug
purposes, as well as for certain use cases like containers or
automatic memory cleanup.
Currently, this option acts as a strict superset of --no-shconf and
--huge-unlink commands.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Previously, it was possible to limit maximum amount of memory
allowed for allocation by creating validator callbacks. Although a
powerful tool, it's a bit of a hassle and requires modifying the
application for it to work with DPDK example applications.
Fix this by adding a new parameter "--socket-limit", with syntax
similar to "--socket-mem", which would set per-socket memory
allocation limits, and set up a default validator callback to deny
all allocations above the limit.
This option is incompatible with legacy mode, as validator callbacks
are not supported there.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Since we are going to need to map hugepages in both primary and
secondary processes, we need to know where we should look for
hugetlbfs mountpoints. So, share those with secondary processes,
and map them on init.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Currently, DPDK stores all pages as separate files in hugetlbfs.
This option will allow storing all pages in one file (one file
per memseg list).
We do this by using fallocate() calls on FreeBSD, however this is
only supported on fairly recent (4.3+) kernels, so ftruncate()
fallback is provided to grow (but not shrink) hugepage files.
Naming scheme is deterministic, so both primary and secondary
processes will be able to easily map needed files and offsets.
For multi-file segments, we can close fd's right away. For
single-file segments, we can reuse the same fd and reduce the
amount of fd's needed to map/use hugepages. However, we need to
store the fd's somewhere, so we add a tailq.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Before, we were aggregating multiple pages into one memseg, so the
number of memsegs was small. Now, each page gets its own memseg,
so the list of memsegs is huge. To accommodate the new memseg list
size and to keep the under-the-hood workings sane, the memseg list
is now not just a single list, but multiple lists. To be precise,
each hugepage size available on the system gets one or more memseg
lists, per socket.
In order to support dynamic memory allocation, we reserve all
memory in advance (unless we're in 32-bit legacy mode, in which
case we do not preallocate memory). As in, we do an anonymous
mmap() of the entire maximum size of memory per hugepage size, per
socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or
RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the
smaller one), split over multiple lists (which are limited to
either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST
megabytes per list, whichever is the smaller one). There is also
a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly
used for 32-bit targets to limit amounts of preallocated memory,
but can be used to place an upper limit on total amount of VA
memory that can be allocated by DPDK application.
So, for each hugepage size, we get (by default) up to 128G worth
of memory, per socket, split into chunks of up to 32G in size.
The address space is claimed at the start, in eal_common_memory.c.
The actual page allocation code is in eal_memalloc.c (Linux-only),
and largely consists of copied EAL memory init code.
Pages in the list are also indexed by address. That is, in order
to figure out where the page belongs, one can simply look at base
address for a memseg list. Similarly, figuring out IOVA address
of a memzone is a matter of finding the right memseg list, getting
offset and dividing by page size to get the appropriate memseg.
This commit also removes rte_eal_dump_physmem_layout() call,
according to deprecation notice [1], and removes that deprecation
notice as well.
On 32-bit targets due to limited VA space, DPDK will no longer
spread memory to different sockets like before. Instead, it will
(by default) allocate all of the memory on socket where master
lcore is. To override this behavior, --socket-mem must be used.
The rest of the changes are really ripple effects from the memseg
change - heap changes, compile fixes, and rewrites to support
fbarray-backed memseg lists. Due to earlier switch to _walk()
functions, most of the changes are simple fixes, however some
of the _walk() calls were switched to memseg list walk, where
it made sense to do so.
Additionally, we are also switching locks from flock() to fcntl().
Down the line, we will be introducing single-file segments option,
and we cannot use flock() locks to lock parts of the file. Therefore,
we will use fcntl() locks for legacy mem as well, in case someone is
unfortunate enough to accidentally start legacy mem primary process
alongside an already working non-legacy mem-based primary process.
[1] http://dpdk.org/dev/patchwork/patch/34002/
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This adds a "--legacy-mem" command-line switch. It will be used to
go back to the old memory behavior, one where we can't dynamically
allocate/free memory (the downside), but one where the user can
get physically contiguous memory, like before (the upside).
For now, nothing but the legacy behavior exists, non-legacy
memory init sequence will be added later. For FreeBSD, non-legacy
memory init will never be enabled, while for Linux, it is
disabled in this patch to avoid breaking bisect, but will be
enabled once non-legacy mode will be fully operational.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Currently, primary process initialization is finalized by setting
the RTE_MAGIC value in the shared config. However, it is not
possible to check whether secondary process initialization has
completed. Add such a value to internal config.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
This patch prefix the mbuf pool ops name with "user" to indicate
that it is user defined.
Signed-off-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Replace the BSD license header with the SPDX tag for files
with only an Intel copyright on them.
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
We remove xen-specific code in EAL, including the option --xen-dom0,
memory initialization code, compiling dependency, etc.
Related documents are removed or updated, and bump the eal library
version.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
DPDK has support for both sw and hw mempool and
currently user is limited to use ring_mp_mc pool.
In case user want to use other pool handle,
need to update config RTE_MEMPOOL_OPS_DEFAULT, then
build and run with desired pool handle.
Introducing eal option to override default pool handle.
Now user can override the RTE_MEMPOOL_OPS_DEFAULT by passing
pool handle to eal `--mbuf-pool-ops-name=""`.
Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
This field is only used in the initialization phase. Remove it since the
global log level can also be retrieved using a public API:
rte_log_get_global_level().
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
Some comments have a wrong space between /** and <.
Seen with
git grep '\*\* <'
Reported-by: David Marchand <david.marchand@6wind.com>
Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com>
When an application using huge-pages crash or exists, the hugetlbfs
backing files are not cleaned up. This is a patch to clean those files.
There are multi-process DPDK applications that may be benefited by those
backing files. Therefore, I have made that configurable so that the
application that does not need those backing files can remove them, thus
not changing the current default behavior. The application itself can
clean it up, however the rationale behind DPDK cleaning it up is, DPDK
created it and therefore, it is better it unlinks it.
Signed-off-by: Shesha Sreenivasamurthy <shesha@cisco.com>
Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
lib/librte_eal/linuxapp/eal/eal_memory.c:324:4: error: comparison
is always false due to limited range of data type [-Werror=type-limits]
|| (hugepage_sz == RTE_PGSIZE_16G)) {
^
This was introuduced by commit b77b5639:
mem: add huge page sizes for IBM Power
The root cause is that size_t is 32-bit in i686 platform,
but RTE_PGSIZE_16M and RTE_PGSIZE_16G are always 64-bit.
Force hugepage_sz to always 64-bit to avoid this issue.
Signed-off-by: Michael Qiu <michael.qiu@intel.com>
Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Thomas Monjalon <thomas.monjalon@6wind.com>
Now that internal config structure is common to Linux and BSD,
we can have a common function to initialize it.
Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Some guards are missing or have a wrong name.
Others have LINUXAPP in their name but are now common.
Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
No need to have different headers for Linux and BSD.
These files are identicals with exception of internal config which has
uio and vfio fields only useful for Linux.
Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>