Add API to allow creating new malloc heaps. They will be created
with socket ID's going above RTE_MAX_NUMA_NODES, to avoid clashing
with internal heaps.
This breaks the ABI, so document the change.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
An API is needed to check whether a particular socket ID belongs
to an internal or external heap. Prime user of this would be
mempool allocator, because normal assumptions of IOVA
contiguousness in IOVA as VA mode do not hold in case of
externally allocated memory.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
When we will be creating external heaps, they will have their own
"fake" socket ID, so add a function that will map the heap name
to its socket ID.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
We will need to refer to external heaps in some way. While we use
heap ID's internally, for external API use it has to be something
more user-friendly. So, we will be using a string to uniquely
identify a heap.
This breaks the ABI, so document the change.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
We will be assigning "invalid" socket ID's to external heap, and
malloc will now be able to verify if a supplied socket ID is in
fact a valid one, rendering parameter checks for sockets
obsolete.
This changes the semantics of what we understand by "socket ID",
so document the change in the release notes.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Switch over all parts of EAL to use heap ID instead of NUMA node
ID to identify heaps. Heap ID for DPDK-internal heaps is NUMA
node's index within the detected NUMA node list. Heap ID for
external heaps will be order of their creation.
This breaks the ABI, so document the changes.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
When we allocate and use DPDK memory, we need to be able to
differentiate between DPDK hugepage segments and segments that
were made part of DPDK but are externally allocated. Add such
a property to memseg lists.
This breaks the ABI, so document the change in release notes.
This also breaks a few internal assumptions about memory
contiguousness, so adjust malloc code in a few places.
All current calls for memseg walk functions were adjusted to
ignore external segments where it made sense.
Mempools is a special case, because we may be asked to allocate
a mempool on a specific socket, and we need to ignore all page
sizes on other heaps or other sockets. Previously, this
assumption of knowing all page sizes was not a problem, but it
will be now, so we have to match socket ID with page size when
calculating minimum page size for a mempool.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
Previously, to calculate length of memory area covered by a memseg
list, we would've needed to multiply page size by length of fbarray
backing that memseg list. This is not obvious and unnecessarily
low level, so store length in the memseg list itself.
This breaks ABI, so bump the EAL ABI version and document the
change. Also, while we're breaking ABI, pack the members a little
better.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Currently, DPDK will skip mapping some areas (or even an entire BAR)
if MSI-X table happens to be in them but is smaller than page size.
Kernels 4.16+ will allow mapping MSI-X BARs [1], and will report this
as a capability flag. Capability flags themselves are also only
supported since kernel 4.6 [2].
This commit will introduce support for checking VFIO capabilities,
and will use it to check if we are allowed to map BARs with MSI-X
tables in them, along with backwards compatibility for older
kernels, including a workaround for a variable rename in VFIO
region info structure [3].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?id=c84982adb23bcf3b99b79ca33527cd2625fbe279
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?id=ff63eb638d63b95e489f976428f1df01391e15e4
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
When NUMA-aware hugepages config option is set, we rely on
libnuma to tell the kernel to allocate hugepages on a specific
NUMA node. However, we allocate node mask before we check if
NUMA is available in the first place, which, according to
the manpage [1], causes undefined behaviour.
Fix by only using nodemask when we have NUMA available.
[1] https://linux.die.net/man/3/numa_alloc_onnode
Bugzilla ID: 20
Fixes: 1b72605d2416 ("mem: balanced allocation of hugepages")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Currently, command-line switches for legacy mem mode or single-file
segments mode are only stored in internal config. This leads to a
situation where these flags have to always match between primary
and secondary, which is bad for usability.
Fix this by storing these flags in the shared config as well, so
that secondary process can know if the primary was launched in
single-file segments or legacy mem mode.
This bumps the EAL ABI, however there's an EAL deprecation notice
already in place[1] for a different feature, so that's OK.
[1] http://patches.dpdk.org/patch/43502/
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
When compiling on FreeBSD, lots of warnings/errors are thrown for
unused parameter. Fix these by marking the parameters as unused
in the code.
Fixes: 1009ba1704f9 ("mem: add internal API to get and set segment fd")
Fixes: 3a44687139eb ("mem: allow querying offset into segment fd")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Enable using memfd-created segments if supported by the system.
This will allow having real fd's for pages but without hugetlbfs
mounts, which will enable in-memory mode to be used with virtio.
The implementation is mostly piggy-backing on existing real-fd
code, except that we no longer need to unlink any files or track
per-page locks in single-file segments mode, because in-memory
mode does not support secondary processes anyway.
We move some checks from EAL command-line parsing code to memalloc
because it is now possible to use single-file segments mode with
in-memory mode, but only if memfd is supported.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
In a few cases, user may need to query offset into fd for a
particular memory segment (for example, to selectively map
pages). This commit adds a new API to do that.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Now that we can retrieve page fd's internally, we can expose it
as an external API. This will add two flavors of API - thread-safe
and non-thread-safe. Fix up internal API's to return values we need
without modifying rte_errno internally if called from within EAL.
We do not want calling code to accidentally close an internal fd, so
we make a duplicate of it before we return it to the user. Caller is
therefore responsible for closing this fd.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Enable setting and retrieving segment fd's internally.
For now, retrieving fd's will not be used anywhere until we
get an external API, but it will be useful for things like
virtio, where we wish to share segment fd's.
Setting segment fd's will not be available as a public API
at this time, but internally it is needed for legacy mode,
because we're not allocating our hugepages in memalloc in
legacy mode case, and we still need to store the fd.
Another user of get segment fd API is memseg info dump, to
show which pages use which fd's.
Not supported on FreeBSD.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Previously, we were only tracking lock file fd's in single-file
segments mode, but did not track fd's in non-single file mode
because we didn't need to (mmap() call still kept the lock). Now
that we are going to expose these fd's to the world, we need to
have access to them, so track them even in non-single file
segments mode.
We don't need to close fd's after mmap() because we're still
tracking them in an fd list. Also, for anonymous hugepages mode,
fd will always be -1 so exit early on error.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Previously, we were only using lock lists to store per-page lock fd's
because we cannot use modern fcntl() file description locks to lock
parts of the page in single file segments mode.
Now, we will be using this list to store either lock fd's (along with
memseg list fd) in single file segments mode, or per-page fd's (and set
memseg list fd to -1), so rename the list accordingly.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Previously, when we allocated hugepages, we closed the fd's corresponding
to them after we've done our mappings. Since we did mmap(), we didn't
actually lose the reference, but file descriptors used for mmap() do not
count against the fd limit. Since we are going to store all of our fd's,
we will hit the fd limit much more often when using smaller page sizes.
Fix this to raise the fd limit to maximum unconditionally.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
In-memory mode was never meant to support legacy mode, because we
cannot sort anonymous pages anyway.
Fixes: 72b49ff623c4 ("mem: support --in-memory mode")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
In noshconf mode, no shared files are created, but we're still trying
to unlink them, resulting in detach/destroy failure even though it
should have succeeded. Fix it by exiting early in noshconf mode.
Fixes: 3ee2cde248a7 ("fbarray: support --no-shconf mode")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
The strncpy function has long been deemed unsafe for use,
in favor of strlcpy or snprintf.
While snprintf is standard and strlcpy is still largely available,
they both have issues regarding error checking and performance.
Both will force reading the source buffer past the requested size
if the input is not a proper c-string, and will return the expected
number of bytes copied, meaning that error checking needs to verify
that the number of bytes copied is not superior to the destination
size.
This contributes to awkward code flow, unclear error checking and
potential issues with malformed input.
The function strscpy has been discussed for some time already and
has been made available in the linux kernel[1].
Propose this new function as a safe alternative.
[1]: http://git.kernel.org/linus/30c44659f4a3
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Juhamatti Kuusisaari <juhamatti.kuusisaari@coriant.com>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
This has been only build-tested for now, on a native ppc64el POWER8E
machine running Debian sid.
Signed-off-by: Luca Boccassi <bluca@debian.org>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
They are built by the legacy makefiles but not by Meson.
Fixes: 8f40ee0734c8 ("eal/x86: get hypervisor name")
Cc: stable@dpdk.org
Signed-off-by: Luca Boccassi <bluca@debian.org>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
We need to do the NULL pointer check first after malloc().
Fixes: 07dcbfe0101f ("malloc: support multiprocess memory hotplug")
Cc: stable@dpdk.org
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Start version numbering for a new release cycle,
and introduce a template file for release notes.
The release notes comments have a new block to suggest
the order of items, inspired by Ferruh's proposal.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: John McNamara <john.mcnamara@intel.com>
Hotplug functions should be used directly to add and remove devices.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
n_bits comes as first argument, align doxygen comment.
n_bit need to not be multiple of 512 as n_bits
are rounding to RTE_BITMAP_CL_BIT_SIZE.
Fixes: 14456f59e9f7 ("doc: fix doxygen warnings in QoS API")
Fixes: de3cfa2c9823 ("sched: initial import")
Cc: stable@dpdk.org
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
Currently, nic_uio driver does not support interrupts, so any
attempts to install an interrupt handler will fail with a
not supported error, which will cause an error message that is
confusing to the user.
Silence this error by moving it to debug log level, and reword
the message to avoid containing the word "Error", to avoid
triggering DTS test failures [1].
[1] https://git.dpdk.org/tools/dts/tree/tests/TestSuite_scatter.py?#n110
Fixes: 23150bd8d8a8 ("eal/bsd: add interrupt thread")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
This commit improves the error checking performed on the
core masks (or lists) of the service cores, in particular
with respect to the data-plane (RTE) cores of DPDK.
With this commit, invalid configurations are detected at
runtime, and warning messages are printed to inform the user.
For example specifying the coremask as 0xf, and the service
coremask as 0xff00 is invalid as not all service-cores are
contained within the coremask. A warning is now printed to
inform the user.
Reported-by: Vipin Varghese <vipin.varghese@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Vipin Varghese <vipin.varghese@intel.com>
A few regressions with virtio/vhost have been discovered, due to the
strong dependency of virtio/vhost on the underlying memory layout.
Specifically, virtio/vhost share all memory pages starting from the
beginning of the segment, while the patch below made it so that the
memory is always allocated from the top of VA space, not from the
bottom.
Fixes: 179f916e88e4 ("mem: allocate in reverse to reduce fragmentation")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
When building with meson on e.g. cygwin, the error message about an
unsupported platform referenced an unknown variable since
"host_machine" was missing an "_".
Fixes: 844514c73569 ("eal: build with meson")
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Clear vfio_group_fd is not necessary to involve any IPC.
Also, current IPC implementation for SOCKET_CLR_GROUP is not
correct. rte_vfio_clear_group on secondary will always fail,
that prevent device be detached correctly on a secondary process.
The patch simply removes all IPC related stuff in
rte_vfio_clear_group.
Fixes: 83a73c5fef66 ("vfio: use generic multi-process channel")
Cc: stable@dpdk.org
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
If hotplug add an already plugged PCI device, it will
cause rte_pci_device->device.name be corrupted due to unexpected
rte_devargs_remove. Also if try to hotplug remove an already
unplugged device, it will cause segment fault due to unexpected
bus->unplug on a rte_device whose driver is NULL.
The patch fix these issues.
Fixes: 7e8b26650146 ("eal: fix hotplug add / remove")
Cc: stable@dpdk.org
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Technically, single file segments codepath will never get
triggered when using in-memory mode, because EAL prohibits
mixing these two options at initialization time. However,
code analyzers do not know that, and some will complain
about either using uninitialized variables, or trying to
do operations on an already closed descriptor.
Fix this by assuring the compiler or code analyzer that
in-memory mode code never gets triggered when using
single-file segments mode.
Coverity issue: 302847
Fixes: 72b49ff623c4 ("mem: support --in-memory mode")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Previously, we were skipping erasing pad because we were
expecting it to be freed when we were merging adjacent
segments. However, if there were no adjacent segments to
merge, we would've skipped erasing the pad, leaving non-zero
memory in our free space.
Fix this by including pad in the erasing unconditionally.
Fixes: e43a9f52b7ff ("malloc: fix pad erasing")
Cc: stable@dpdk.org
Reported-by: Andrew Rybchenko <arybchenko@solarflare.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
Space for string terminating NUL character should be provided to
snprintf() to avoid the last symbol truncation.
Fixes: a23bc2c4e01b ("devargs: add non-variadic parsing function")
Reported-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Currently, we need runtime dir to put all of our runtime info in,
including the DPDK shared config. However, we use the shared
config to determine our proc type, and this happens earlier than
we actually create the config dir and thus can know where to
place the config file.
Fix this by moving runtime dir creation right after the EAL
arguments parsing, but before proc type autodetection. Also,
previously we were creating the config file unconditionally,
even if we specified no_shconf - fix it by only creating
the config file if no_shconf is not set.
Fixes: adf1d867361c ("eal: move runtime config file to new location")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Lei Yao <lei.a.yao@intel.com>
The original code did not align any addresses that were requested as
page-aligned, but were different because addr_is_hint was set.
Below fix by Dariusz has introduced an issue where all unaligned addresses
were left as unaligned.
This patch is a partial revert of
commit 7fa7216ed48d ("mem: fix alignment of requested virtual areas")
and implements a proper fix for this issue, by asking for alignment in all
but the following two cases:
1) page size is equal to system page size, or
2) we got an aligned requested address, and will not accept a different one
This ensures that alignment is performed in all cases, except for those we
can guarantee that the address will not need alignment.
Fixes: b7cc54187ea4 ("mem: move virtual area function in common directory")
Fixes: 7fa7216ed48d ("mem: fix alignment of requested virtual areas")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Lei Yao <lei.a.yao@intel.com>
Acked-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Fixed possible out-of-bounds issue:
lib/librte_eal/common/eal_common_devargs.c:
In function ‘rte_devargs_layers_parse’:
lib/librte_eal/common/eal_common_devargs.c:121:7:
error: array subscript is above array bounds
Bugzilla ID: 71
Fixes: 338327d731e6 ("devargs: add function to parse device layers")
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Use the iteration hooks in the abstraction layers to perform the
requested filtering on the internal device lists.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Parse a device description.
Split this description in their relevant part for each layers.
No dynamic allocation is performed.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>