Now that we can retrieve page fd's internally, we can expose it
as an external API. This will add two flavors of API - thread-safe
and non-thread-safe. Fix up internal API's to return values we need
without modifying rte_errno internally if called from within EAL.
We do not want calling code to accidentally close an internal fd, so
we make a duplicate of it before we return it to the user. Caller is
therefore responsible for closing this fd.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Enable setting and retrieving segment fd's internally.
For now, retrieving fd's will not be used anywhere until we
get an external API, but it will be useful for things like
virtio, where we wish to share segment fd's.
Setting segment fd's will not be available as a public API
at this time, but internally it is needed for legacy mode,
because we're not allocating our hugepages in memalloc in
legacy mode case, and we still need to store the fd.
Another user of get segment fd API is memseg info dump, to
show which pages use which fd's.
Not supported on FreeBSD.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Previously, we were only tracking lock file fd's in single-file
segments mode, but did not track fd's in non-single file mode
because we didn't need to (mmap() call still kept the lock). Now
that we are going to expose these fd's to the world, we need to
have access to them, so track them even in non-single file
segments mode.
We don't need to close fd's after mmap() because we're still
tracking them in an fd list. Also, for anonymous hugepages mode,
fd will always be -1 so exit early on error.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Previously, we were only using lock lists to store per-page lock fd's
because we cannot use modern fcntl() file description locks to lock
parts of the page in single file segments mode.
Now, we will be using this list to store either lock fd's (along with
memseg list fd) in single file segments mode, or per-page fd's (and set
memseg list fd to -1), so rename the list accordingly.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Previously, when we allocated hugepages, we closed the fd's corresponding
to them after we've done our mappings. Since we did mmap(), we didn't
actually lose the reference, but file descriptors used for mmap() do not
count against the fd limit. Since we are going to store all of our fd's,
we will hit the fd limit much more often when using smaller page sizes.
Fix this to raise the fd limit to maximum unconditionally.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
A few regressions with virtio/vhost have been discovered, due to the
strong dependency of virtio/vhost on the underlying memory layout.
Specifically, virtio/vhost share all memory pages starting from the
beginning of the segment, while the patch below made it so that the
memory is always allocated from the top of VA space, not from the
bottom.
Fixes: 179f916e88 ("mem: allocate in reverse to reduce fragmentation")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Clear vfio_group_fd is not necessary to involve any IPC.
Also, current IPC implementation for SOCKET_CLR_GROUP is not
correct. rte_vfio_clear_group on secondary will always fail,
that prevent device be detached correctly on a secondary process.
The patch simply removes all IPC related stuff in
rte_vfio_clear_group.
Fixes: 83a73c5fef ("vfio: use generic multi-process channel")
Cc: stable@dpdk.org
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Technically, single file segments codepath will never get
triggered when using in-memory mode, because EAL prohibits
mixing these two options at initialization time. However,
code analyzers do not know that, and some will complain
about either using uninitialized variables, or trying to
do operations on an already closed descriptor.
Fix this by assuring the compiler or code analyzer that
in-memory mode code never gets triggered when using
single-file segments mode.
Coverity issue: 302847
Fixes: 72b49ff623 ("mem: support --in-memory mode")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, we need runtime dir to put all of our runtime info in,
including the DPDK shared config. However, we use the shared
config to determine our proc type, and this happens earlier than
we actually create the config dir and thus can know where to
place the config file.
Fix this by moving runtime dir creation right after the EAL
arguments parsing, but before proc type autodetection. Also,
previously we were creating the config file unconditionally,
even if we specified no_shconf - fix it by only creating
the config file if no_shconf is not set.
Fixes: adf1d86736 ("eal: move runtime config file to new location")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Lei Yao <lei.a.yao@intel.com>
This function is private to the EAL.
It is used to parse each layers in a device description string,
and store the result in an rte_devargs structure.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
This abstraction exists since the infancy of DPDK.
It needs to be fleshed out however, to allow a generic
description of devices properties and capabilities.
A device class is the northbound interface of the device, intended
for applications to know what it can be used for.
It is conceptually just above buses.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Since uuid functions may not be available everywhere, implement
uuid functions in DPDK. These are based off the BSD licensed
libuuid in util-link.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Implement the final piece of the in-memory mode puzzle - enable running
DPDK entirely in memory, without creating any files.
To do it, use mmap with MAP_HUGETLB and size flags to enable DPDK to work
without hugetlbfs mountpoints. In order to enable this, a few things needed
to be changed.
First of all, we need to allow empty hugetlbfs mountpoints in
hugepage_info, and handle them correctly (by not trying to create any
files and lock any directories).
Next, we need to reorder the mapping sequence, because the page is not
really allocated until the page fault, and we cannot get its IOVA
address before we trigger the page fault.
Finally, decide at compile time whether we are going to be supporting
anonymous hugepages or not, because we cannot check for it at runtime.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Unlink hugepages after creating them, to honor the hugepage-unlink mode.
We cannot resize non-existing files, so make single file segments
explicitly unsupported.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Now that the rest of the EAL is adjusted to not create any shared
files, prevent runtime directory from ever being created.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Do not create any shared hugepage size info files if we were
asked to not create any shared files.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Next commit will make asynchronous IPC requests rely on alarm API,
which in turn relies on interrupts to work. Therefore, move the EAL
interrupt initialization before IPC initialization to avoid breaking
IPC in the next commit.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
IPC uses interrupts API internally, and memory subsystem uses IPC.
Therefore, IPC should not use rte_malloc to avoid circular dependency.
Switch to using regular glibc malloc in interrupts API.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Alarm API is going to be used by IPC internally. However, because
memory subsystem depends on IPC, alarm API cannot use rte_malloc as
it creates a circular dependency.
To avoid such chicken and egg problem, we change to use glibc malloc
in the alarm API.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Some static analyzers complain about it, even though
value is never used if not initialized. To avoid additional
false positives about a potential null-pointer dereferences,
also add a null-check.
Bugzilla ID: 58
Fixes: ea2dc10668 ("vfio: add multi container support")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
The value is not used, but some static analyzers may give out a
warning. Fix it by assigning default value of zero.
Bugzilla ID: 58
Fixes: cdc242f260 ("eal/linux: support running as unprivileged user")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Parentheses were missing. It worked because macro is enclosed in
parentheses, so syntax was valid after macro expansion.
Bugzilla ID: 58
Fixes: 0a45657a67 ("pci: rework interrupt handling")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Previously, it was possible to limit maximum amount of memory
allowed for allocation by creating validator callbacks. Although a
powerful tool, it's a bit of a hassle and requires modifying the
application for it to work with DPDK example applications.
Fix this by adding a new parameter "--socket-limit", with syntax
similar to "--socket-mem", which would set per-socket memory
allocation limits, and set up a default validator callback to deny
all allocations above the limit.
This option is incompatible with legacy mode, as validator callbacks
are not supported there.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_list_walk() inside callbacks.
Also, remove existing reimplementation from memalloc code and use
the new API instead.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_walk() inside callbacks.
Also, remove existing reimplementation from sPAPR VFIO code and use
the new API instead.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
When rte_eal_cleanup() is called, it is expected that DPDK will be able to
release all of its memory back to the system. However, if pages are marked
as unfreeable, the pages will not be released back. Fix this to mark all
pages as freeable on calling rte_eal_cleanup(), but only do it for primary
process, as secondaries can come and go.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, all hugepages are allocated from lower VA address to
higher VA address, while malloc heap allocates from higher VA
address to lower VA address. This results in heap fragmentation
over time due to multiple reserves leaving small space below the
allocated elements.
Fix this by allocating VA memory from the top, thereby reducing
fragmentation and lowering overall memory usage.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
In the perfect world, it wouldn't matter how much memory was
preallocated because most of it was always going to be private
anonymous zero-page mappings for the duration of the program.
However, in practice, due to peculiarities of FreeBSD, we need
to additionally limit memory allocation there. This patch moves
the segment preallocation to EAL private functions that will be
implemented by an OS-specific EAL rather than being in the common
memory-related code.
Since there is no support for growing/shrinking memory use at
runtime on FreeBSD anyway, this does not inhibit any functionality
but makes core dumps faster even on default settings.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
The doc says this function returns negative errno
on error, but it currently returns either -1 or
positive errno.
It was incorrectly assumed that pthread_setname_np()
returns negative error numbers. It always returns
positive ones, so this patch negates its return value
before returning.
Fixes: 3901ed99c2 ("eal: fix thread naming on FreeBSD")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
This isn't documented in the manuals, but a failed
mmap(..., MAP_FIXED) may still unmap overlapping
regions. In such case, we need to remap these regions
back into our address space to ensure mem contiguity.
We do it unconditionally now on mmap failure just to
be safe.
Verified on Linux 4.9.0-4-amd64. I was getting
ENOMEM when trying to map hugetlbfs with no space
left, and the previous anonymous mapping was still
being removed.
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
EAL reserves a huge area in virtual address space
to provide virtual address contiguity for e.g.
future memory extensions (memory hotplug). During
memory hotplug, if the hugepage mmap succeeds but
doesn't suffice EAL's requiriments, the EAL would
unmap this mapping straight away, leaving a hole in
its virtual memory area and making it available
to everyone. As EAL still thinks it owns the entire
region, it may try to mmap it later with MAP_FIXED,
possibly overriding a user's mapping that was made
in the meantime.
This patch ensures each hole is mapped back by EAL,
so that it won't be available to anyone else.
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Rather than copy the log message, we can use a precision in the format
string given to syslog.
Signed-off-by: David Marchand <david.marchand@6wind.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
Executable bit must be set on directories for normal users to enter them.
This patch addresses the inability to start DPDK applications as non-root
due to errors such as:
EAL: failed to bind /tmp/dpdk/rte/mp_socket: Permission denied
Fixes: 56236363b4 ("eal: add directory for runtime data")
Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
The intention of the original code was to create runtime data
directory as early as possible, however it was moved too early,
before the arguments were parsed, resulting in --file-prefix
option essentially not working.
Fix this by moving eal_create_runtime_dir() to after command
line arguments parsing.
Fixes: 56236363b4 ("eal: add directory for runtime data")
Reported-by: Andrew Rybchenko <arybchenko@solarflare.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
Fix all calls to functions in eal_filesystem to produce paths
residing inside dedicated DPDK runtime directory. Leaving DPDK
runtime config in place as 3rd-party applications within the
DPDK ecosystem might rely on this path to determine whether
DPDK is running, so moving that will be postponed to the next
release cycle.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, during runtime, DPDK will store a bunch of files here
and there (in /var/run, /tmp or in $HOME). Fix it by creating a
DPDK-specific runtime directory, under which all runtime data
will be placed. The template for creating this runtime directory
is the following:
<base path>/dpdk/<DPDK prefix>/
Where <base path> is set to either "/var/run" if run as root, or
$XDG_RUNTIME_DIR if run as non-root, with a fallback to /tmp if
$XDG_RUNTIME_DIR is not defined. So, for example, if run as root,
by default all runtime data will be stored at /var/run/dpdk/rte/.
There is no equivalent of "mkdir -p", so we will be creating the
path step by step.
Nothing uses this new path yet, changes for that will come in
next commit.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
The original name for this path was not too descriptive and
confusing. Rename it to a more appropriate and descriptive name:
it stores data about hugepages, so name it eal_hugepage_data_path().
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
Currently, page deallocation might fail if allocator cannot get page
fd, which will leave VA space still mapped, and will also not mark
page as free.
Fix page deallocation function to always unmap space before trying
to get rid of the page itself, and always mark page as free even if
page deallocation failed.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Fixes: 1a7dc2252f ("mem: revert to using flock and add per-segment lockfiles")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Return value should be zero for success, but if unlock and unlink
have succeeded, return value was 1, which triggered failure message
in calling code.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Segment index was calculated incorrectly, causing free_seg to
attempt to free segments that do not exist.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Yong Liu <yong.liu@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
The code aimed to pick and remember the value of
mempool ops name from EAL command line arguments does not
copy the string and remembers the pointer provided
by getopt_long() directly. The latter could be clobbered
later and result in reading wrong mbuf pool ops name
by rte_mempool library.
Typically, this flaw could be avoided by using strdup()
to remember the string value of the option.
Fixes: a103a97e71 ("eal: allow user to override default mempool driver")
Cc: stable@dpdk.org
Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
It may be useful to pass arbitrary data to the callback (such
as device pointers), so add this to the mem event callback API.
Suggested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
At hugepage info initialization, EAL takes out a write lock on
hugetlbfs directories, and drops it after the memory init is
finished. However, in non-legacy mode, if "-m" or "--socket-mem"
switches are passed, this leads to a deadlock because EAL tries
to allocate pages (and thus take out a write lock on hugedir)
while still holding a separate hugedir write lock in EAL.
Fix it by checking if write lock in hugepage info is active, and
not trying to lock the directory if the hugedir fd is valid.
Fixes: 1a7dc2252f ("mem: revert to using flock and add per-segment lockfiles")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Shahaf Shuler <shahafs@mellanox.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
The original implementation used flock() locks, but was later
switched to using fcntl() locks for page locking, because
fcntl() locks allow locking parts of a file, which is useful
for single-file segments mode, where locking the entire file
isn't as useful because we still need to grow and shrink it.
However, according to fcntl()'s Ubuntu manpage [1], semantics of
fcntl() locks have a giant oversight:
This interface follows the completely stupid semantics of System
V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
locks associated with a file for a given process are removed
when any file descriptor for that file is closed by that process.
This semantic means that applications must be aware of any files
that a subroutine library may access.
Basically, closing *any* fd with an fcntl() lock (which we do because
we don't want to leak fd's) will drop the lock completely.
So, in this commit, we will be reverting back to using flock() locks
everywhere. However, that still leaves the problem of locking parts
of a memseg list file in single file segments mode, and we will be
solving it with creating separate lock files per each page, and
tracking those with flock().
We will also be removing all of this tailq business and replacing it
with a simple array - saving a few bytes is not worth the extra
hassle of dealing with pointers and potential memory allocation
failures. Also, remove the tailq lock since it is not needed - these
fd lists are per-process, and within a given process, it is always
only one thread handling access to hugetlbfs.
So, first one to allocate a segment will create a lockfile, and put
a shared lock on it. When we're shrinking the page file, we will be
trying to take out a write lock on that lockfile, which would fail if
any other process is holding onto the lockfile as well. This way, we
can know if we can shrink the segment file. Also, if no other locks
are found in the lock list for a given memseg list, the memseg list
fd is automatically closed.
One other thing to note is, according to flock() Ubuntu manpage [2],
upgrading the lock from shared to exclusive is implemented by dropping
and reacquiring the lock, which is not atomic and thus would have
created race conditions. So, on attempting to perform operations in
hugetlbfs, we will take out a writelock on hugetlbfs directory, so
that only one process could perform hugetlbfs operations concurrently.
[1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
[2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Fixes: 2a04139f66 ("eal: add single file segments option")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>