Commit Graph

869 Commits

Author SHA1 Message Date
Anatoly Burakov
41dbdb6872 mem: add external API to retrieve page fd
Now that we can retrieve page fd's internally, we can expose it
as an external API. This will add two flavors of API - thread-safe
and non-thread-safe. Fix up internal API's to return values we need
without modifying rte_errno internally if called from within EAL.

We do not want calling code to accidentally close an internal fd, so
we make a duplicate of it before we return it to the user. Caller is
therefore responsible for closing this fd.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-09-19 14:48:04 +02:00
Anatoly Burakov
1009ba1704 mem: add internal API to get and set segment fd
Enable setting and retrieving segment fd's internally.

For now, retrieving fd's will not be used anywhere until we
get an external API, but it will be useful for things like
virtio, where we wish to share segment fd's.

Setting segment fd's will not be available as a public API
at this time, but internally it is needed for legacy mode,
because we're not allocating our hugepages in memalloc in
legacy mode case, and we still need to store the fd.

Another user of get segment fd API is memseg info dump, to
show which pages use which fd's.

Not supported on FreeBSD.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-09-19 14:46:34 +02:00
Anatoly Burakov
16cab6e5c8 mem: track page fd in non-single file mode
Previously, we were only tracking lock file fd's in single-file
segments mode, but did not track fd's in non-single file mode
because we didn't need to (mmap() call still kept the lock). Now
that we are going to expose these fd's to the world, we need to
have access to them, so track them even in non-single file
segments mode.

We don't need to close fd's after mmap() because we're still
tracking them in an fd list. Also, for anonymous hugepages mode,
fd will always be -1 so exit early on error.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-09-19 14:44:11 +02:00
Anatoly Burakov
a033a4158b mem: rename lock list to fd list
Previously, we were only using lock lists to store per-page lock fd's
because we cannot use modern fcntl() file description locks to lock
parts of the page in single file segments mode.

Now, we will be using this list to store either lock fd's (along with
memseg list fd) in single file segments mode, or per-page fd's (and set
memseg list fd to -1), so rename the list accordingly.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-09-19 14:43:14 +02:00
Anatoly Burakov
18329a4366 mem: raise maximum fd limit unconditionally
Previously, when we allocated hugepages, we closed the fd's corresponding
to them after we've done our mappings. Since we did mmap(), we didn't
actually lose the reference, but file descriptors used for mmap() do not
count against the fd limit. Since we are going to store all of our fd's,
we will hit the fd limit much more often when using smaller page sizes.

Fix this to raise the fd limit to maximum unconditionally.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-09-19 14:41:38 +02:00
Olivier Matz
6b867cc113 eal: remove experimental tag for user mbuf pool ops
Remove experimental tag from rte_eal_mbuf_user_pool_ops().

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
2018-08-09 01:03:14 +02:00
Olivier Matz
83a8a143bb eal: remove deprecated function for mbuf pool ops
rte_eal_mbuf_default_mempool_ops() is replaced by
rte_mbuf_best_mempool_ops().

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
2018-08-09 01:03:14 +02:00
Hemant Agrawal
787ae736a3 vfio: remove experimental tag
Signed-off-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-26 23:46:18 +02:00
Anatoly Burakov
7350d1be05 mem: revert reversed allocation
A few regressions with virtio/vhost have been discovered, due to the
strong dependency of virtio/vhost on the underlying memory layout.
Specifically, virtio/vhost share all memory pages starting from the
beginning of the segment, while the patch below made it so that the
memory is always allocated from the top of VA space, not from the
bottom.

Fixes: 179f916e88 ("mem: allocate in reverse to reduce fragmentation")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-26 11:15:52 +02:00
Qi Zhang
6a015363b3 vfio: remove uneccessary IPC for group fd clear
Clear vfio_group_fd is not necessary to involve any IPC.
Also, current IPC implementation for SOCKET_CLR_GROUP is not
correct. rte_vfio_clear_group on secondary will always fail,
that prevent device be detached correctly on a secondary process.
The patch simply removes all IPC related stuff in
rte_vfio_clear_group.

Fixes: 83a73c5fef ("vfio: use generic multi-process channel")
Cc: stable@dpdk.org

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-20 14:26:16 +02:00
Anatoly Burakov
dd536a8bc5 mem: add logic check for static analyzer
Technically, single file segments codepath will never get
triggered when using in-memory mode, because EAL prohibits
mixing these two options at initialization time. However,
code analyzers do not know that, and some will complain
about either using uninitialized variables, or trying to
do operations on an already closed descriptor.

Fix this by assuring the compiler or code analyzer that
in-memory mode code never gets triggered when using
single-file segments mode.

Coverity issue: 302847
Fixes: 72b49ff623 ("mem: support --in-memory mode")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-20 11:32:03 +02:00
Anatoly Burakov
e4ea1bbd6e eal: fix dependency in multi-process detection
Currently, we need runtime dir to put all of our runtime info in,
including the DPDK shared config. However, we use the shared
config to determine our proc type, and this happens earlier than
we actually create the config dir and thus can know where to
place the config file.

Fix this by moving runtime dir creation right after the EAL
arguments parsing, but before proc type autodetection. Also,
previously we were creating the config file unconditionally,
even if we specified no_shconf - fix it by only creating
the config file if no_shconf is not set.

Fixes: adf1d86736 ("eal: move runtime config file to new location")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Lei Yao <lei.a.yao@intel.com>
2018-07-19 12:05:14 +02:00
Gaetan Rivet
338327d731 devargs: add function to parse device layers
This function is private to the EAL.
It is used to parse each layers in a device description string,
and store the result in an rte_devargs structure.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
2018-07-15 23:43:34 +02:00
Gaetan Rivet
d70f8448d0 eal: introduce device class abstraction
This abstraction exists since the infancy of DPDK.
It needs to be fleshed out however, to allow a generic
description of devices properties and capabilities.

A device class is the northbound interface of the device, intended
for applications to know what it can be used for.

It is conceptually just above buses.

Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
2018-07-15 23:42:53 +02:00
Stephen Hemminger
6bc67c497a eal: add uuid API
Since uuid functions may not be available everywhere, implement
uuid functions in DPDK. These are based off the BSD licensed
libuuid in util-link.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
2018-07-13 23:42:08 +02:00
Anatoly Burakov
72b49ff623 mem: support --in-memory mode
Implement the final piece of the in-memory mode puzzle - enable running
DPDK entirely in memory, without creating any files.

To do it, use mmap with MAP_HUGETLB and size flags to enable DPDK to work
without hugetlbfs mountpoints. In order to enable this, a few things needed
to be changed.

First of all, we need to allow empty hugetlbfs mountpoints in
hugepage_info, and handle them correctly (by not trying to create any
files and lock any directories).

Next, we need to reorder the mapping sequence, because the page is not
really allocated until the page fault, and we cannot get its IOVA
address before we trigger the page fault.

Finally, decide at compile time whether we are going to be supporting
anonymous hugepages or not, because we cannot check for it at runtime.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 15:35:43 +02:00
Anatoly Burakov
d435aad37d mem: support --huge-unlink mode
Unlink hugepages after creating them, to honor the hugepage-unlink mode.
We cannot resize non-existing files, so make single file segments
explicitly unsupported.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 15:34:17 +02:00
Anatoly Burakov
5cb42707bc eal: do not create runtime dir in --no-shconf mode
Now that the rest of the EAL is adjusted to not create any shared
files, prevent runtime directory from ever being created.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 15:33:51 +02:00
Anatoly Burakov
cb14962a00 eal: support --no-shconf in hugepage data file
Do not create a shared hugepage data file if we were asked to
not create any shared files.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 15:33:27 +02:00
Anatoly Burakov
7296447acb eal: support --no-shconf for hugepage info
Do not create any shared hugepage size info files if we were
asked to not create any shared files.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 15:33:07 +02:00
Jianfeng Tan
d74b7748d6 eal: bring forward init of interrupt handling
Next commit will make asynchronous IPC requests rely on alarm API,
which in turn relies on interrupts to work. Therefore, move the EAL
interrupt initialization before IPC initialization to avoid breaking
IPC in the next commit.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 12:41:15 +02:00
Jianfeng Tan
4bb69970af eal/linux: use libc malloc in interrupt handling
IPC uses interrupts API internally, and memory subsystem uses IPC.
Therefore, IPC should not use rte_malloc to avoid circular dependency.
Switch to using regular glibc malloc in interrupts API.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 12:40:25 +02:00
Jianfeng Tan
204df26c1b eal/linux: use libc malloc in alarm
Alarm API is going to be used by IPC internally. However, because
memory subsystem depends on IPC, alarm API cannot use rte_malloc as
it creates a circular dependency.

To avoid such chicken and egg problem, we change to use glibc malloc
in the alarm API.

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 12:39:51 +02:00
Anatoly Burakov
c63a42535a vfio: fix uninitialized variable
Some static analyzers complain about it, even though
value is never used if not initialized. To avoid additional
false positives about a potential null-pointer dereferences,
also add a null-check.

Bugzilla ID: 58
Fixes: ea2dc10668 ("vfio: add multi container support")
Cc: stable@dpdk.org

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:44:56 +02:00
Anatoly Burakov
96712b33af eal/linux: fix uninitialized value
The value is not used, but some static analyzers may give out a
warning. Fix it by assigning default value of zero.

Bugzilla ID: 58
Fixes: cdc242f260 ("eal/linux: support running as unprivileged user")
Cc: stable@dpdk.org

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:44:43 +02:00
Anatoly Burakov
462dd3722e eal/linux: fix invalid syntax in interrupts
Parentheses were missing. It worked because macro is enclosed in
parentheses, so syntax was valid after macro expansion.

Bugzilla ID: 58
Fixes: 0a45657a67 ("pci: rework interrupt handling")
Cc: stable@dpdk.org

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:44:17 +02:00
Anatoly Burakov
e4348122a4 eal: add option to limit memory allocation on sockets
Previously, it was possible to limit maximum amount of memory
allowed for allocation by creating validator callbacks. Although a
powerful tool, it's a bit of a hassle and requires modifying the
application for it to work with DPDK example applications.

Fix this by adding a new parameter "--socket-limit", with syntax
similar to "--socket-mem", which would set per-socket memory
allocation limits, and set up a default validator callback to deny
all allocations above the limit.

This option is incompatible with legacy mode, as validator callbacks
are not supported there.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:44:15 +02:00
Anatoly Burakov
e26415428f mem: provide thread-unsafe memseg list walk variant
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_list_walk() inside callbacks.

Also, remove existing reimplementation from memalloc code and use
the new API instead.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:21:25 +02:00
Anatoly Burakov
7c790af08f mem: provide thread-unsafe memseg walk variant
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_walk() inside callbacks.

Also, remove existing reimplementation from sPAPR VFIO code and use
the new API instead.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:21:15 +02:00
Anatoly Burakov
76480e3885 mem: mark pages as freeable on exit
When rte_eal_cleanup() is called, it is expected that DPDK will be able to
release all of its memory back to the system. However, if pages are marked
as unfreeable, the pages will not be released back. Fix this to mark all
pages as freeable on calling rte_eal_cleanup(), but only do it for primary
process, as secondaries can come and go.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:06:14 +02:00
Anatoly Burakov
179f916e88 mem: allocate in reverse to reduce fragmentation
Currently, all hugepages are allocated from lower VA address to
higher VA address, while malloc heap allocates from higher VA
address to lower VA address. This results in heap fragmentation
over time due to multiple reserves leaving small space below the
allocated elements.

Fix this by allocating VA memory from the top, thereby reducing
fragmentation and lowering overall memory usage.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:04:53 +02:00
Anatoly Burakov
1d406458db mem: make segment preallocation OS-specific
In the perfect world, it wouldn't matter how much memory was
preallocated because most of it was always going to be private
anonymous zero-page mappings for the duration of the program.
However, in practice, due to peculiarities of FreeBSD, we need
to additionally limit memory allocation there. This patch moves
the segment preallocation to EAL private functions that will be
implemented by an OS-specific EAL rather than being in the common
memory-related code.

Since there is no support for growing/shrinking memory use at
runtime on FreeBSD anyway, this does not inhibit any functionality
but makes core dumps faster even on default settings.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 00:59:18 +02:00
Dariusz Stojaczyk
82dcc8b4bc eal: fix return codes on thread naming failure
The doc says this function returns negative errno
on error, but it currently returns either -1 or
positive errno.

It was incorrectly assumed that pthread_setname_np()
returns negative error numbers. It always returns
positive ones, so this patch negates its return value
before returning.

Fixes: 3901ed99c2 ("eal: fix thread naming on FreeBSD")
Cc: stable@dpdk.org

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
2018-07-13 00:26:22 +02:00
Dariusz Stojaczyk
0762c438b8 mem: do not unmap overlapping region on mmap failure
This isn't documented in the manuals, but a failed
mmap(..., MAP_FIXED) may still unmap overlapping
regions. In such case, we need to remap these regions
back into our address space to ensure mem contiguity.
We do it unconditionally now on mmap failure just to
be safe.

Verified on Linux 4.9.0-4-amd64. I was getting
ENOMEM when trying to map hugetlbfs with no space
left, and the previous anonymous mapping was still
being removed.

Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Cc: stable@dpdk.org

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 00:25:07 +02:00
Dariusz Stojaczyk
637175ab95 mem: do not leave unmapped holes in EAL memory area
EAL reserves a huge area in virtual address space
to provide virtual address contiguity for e.g.
future memory extensions (memory hotplug). During
memory hotplug, if the hugepage mmap succeeds but
doesn't suffice EAL's requiriments, the EAL would
unmap this mapping straight away, leaving a hole in
its virtual memory area and making it available
to everyone. As EAL still thinks it owns the entire
region, it may try to mmap it later with MAP_FIXED,
possibly overriding a user's mapping that was made
in the meantime.

This patch ensures each hole is mapped back by EAL,
so that it won't be available to anyone else.

Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Cc: stable@dpdk.org

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 00:25:05 +02:00
David Marchand
0c41aab8e2 log: remove useless intermediate buffer
Rather than copy the log message, we can use a precision in the format
string given to syslog.

Signed-off-by: David Marchand <david.marchand@6wind.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
2018-06-27 18:17:56 +02:00
Adrien Mazarguil
97c228a0aa eal: fix runtime directory permissions
Executable bit must be set on directories for normal users to enter them.

This patch addresses the inability to start DPDK applications as non-root
due to errors such as:

 EAL: failed to bind /tmp/dpdk/rte/mp_socket: Permission denied

Fixes: 56236363b4 ("eal: add directory for runtime data")

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
2018-05-21 01:08:26 +02:00
Anatoly Burakov
3f697d2ee5 eal: move runtime directory creation after args parsing
The intention of the original code was to create runtime data
directory as early as possible, however it was moved too early,
before the arguments were parsed, resulting in --file-prefix
option essentially not working.

Fix this by moving eal_create_runtime_dir() to after command
line arguments parsing.

Fixes: 56236363b4 ("eal: add directory for runtime data")

Reported-by: Andrew Rybchenko <arybchenko@solarflare.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
2018-05-15 15:22:40 +02:00
Anatoly Burakov
5b18d86dec eal: move runtime data into dedicated directory
Fix all calls to functions in eal_filesystem to produce paths
residing inside dedicated DPDK runtime directory. Leaving DPDK
runtime config in place as 3rd-party applications within the
DPDK ecosystem might rely on this path to determine whether
DPDK is running, so moving that will be postponed to the next
release cycle.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-05-15 00:35:12 +02:00
Anatoly Burakov
56236363b4 eal: add directory for runtime data
Currently, during runtime, DPDK will store a bunch of files here
and there (in /var/run, /tmp or in $HOME). Fix it by creating a
DPDK-specific runtime directory, under which all runtime data
will be placed. The template for creating this runtime directory
is the following:

  <base path>/dpdk/<DPDK prefix>/

Where <base path> is set to either "/var/run" if run as root, or
$XDG_RUNTIME_DIR if run as non-root, with a fallback to /tmp if
$XDG_RUNTIME_DIR is not defined. So, for example, if run as root,
by default all runtime data will be stored at /var/run/dpdk/rte/.

There is no equivalent of "mkdir -p", so we will be creating the
path step by step.

Nothing uses this new path yet, changes for that will come in
next commit.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
2018-05-15 00:35:08 +02:00
Anatoly Burakov
a2a2e499e5 mem: rename function returning hugepage data path
The original name for this path was not too descriptive and
confusing. Rename it to a more appropriate and descriptive name:
it stores data about hugepages, so name it eal_hugepage_data_path().

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
2018-05-15 00:35:02 +02:00
Anatoly Burakov
cd3da7cb0e mem: fix unmapping and marking segments as free
Currently, page deallocation might fail if allocator cannot get page
fd, which will leave VA space still mapped, and will also not mark
page as free.

Fix page deallocation function to always unmap space before trying
to get rid of the page itself, and always mark page as free even if
page deallocation failed.

Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Fixes: 1a7dc2252f ("mem: revert to using flock and add per-segment lockfiles")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
2018-05-14 03:17:48 +02:00
Anatoly Burakov
3d2d9861a6 mem: fix return code of freeing segment on failure
Return value should be zero for success, but if unlock and unlink
have succeeded, return value was 1, which triggered failure message
in calling code.

Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
2018-05-14 03:15:36 +02:00
Anatoly Burakov
b3b1b83bad mem: fix index for unmapping segments on failure
Segment index was calculated incorrectly, causing free_seg to
attempt to free segments that do not exist.

Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Yong Liu <yong.liu@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
2018-05-14 03:15:33 +02:00
Thomas Monjalon
c9d034d873 mem: fix typo in local function name
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-05-14 01:32:14 +02:00
Ivan Malov
33b3181791 eal: fix mempool ops name parsing
The code aimed to pick and remember the value of
mempool ops name from EAL command line arguments does not
copy the string and remembers the pointer provided
by getopt_long() directly. The latter could be clobbered
later and result in reading wrong mbuf pool ops name
by rte_mempool library.

Typically, this flaw could be avoided by using strdup()
to remember the string value of the option.

Fixes: a103a97e71 ("eal: allow user to override default mempool driver")
Cc: stable@dpdk.org

Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
2018-05-14 01:32:07 +02:00
Anatoly Burakov
0256386dc4 mem: add argument to memory event callback
It may be useful to pass arbitrary data to the callback (such
as device pointers), so add this to the mem event callback API.

Suggested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-05-08 22:28:58 +02:00
Anatoly Burakov
eb8d29f825 mem/linux: fix hugedir write deadlock
At hugepage info initialization, EAL takes out a write lock on
hugetlbfs directories, and drops it after the memory init is
finished. However, in non-legacy mode, if "-m" or "--socket-mem"
switches are passed, this leads to a deadlock because EAL tries
to allocate pages (and thus take out a write lock on hugedir)
while still holding a separate hugedir write lock in EAL.

Fix it by checking if write lock in hugepage info is active, and
not trying to lock the directory if the hugedir fd is valid.

Fixes: 1a7dc2252f ("mem: revert to using flock and add per-segment lockfiles")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Shahaf Shuler <shahafs@mellanox.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
2018-04-30 15:23:17 +02:00
Anatoly Burakov
1a7dc2252f mem: revert to using flock and add per-segment lockfiles
The original implementation used flock() locks, but was later
switched to using fcntl() locks for page locking, because
fcntl() locks allow locking parts of a file, which is useful
for single-file segments mode, where locking the entire file
isn't as useful because we still need to grow and shrink it.

However, according to fcntl()'s Ubuntu manpage [1], semantics of
fcntl() locks have a giant oversight:

  This interface follows the completely stupid semantics of System
  V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
  locks associated with a file for a given process are removed
  when any file descriptor for that file is closed by that process.
  This semantic means that applications must be aware of any files
  that a subroutine library may access.

Basically, closing *any* fd with an fcntl() lock (which we do because
we don't want to leak fd's) will drop the lock completely.

So, in this commit, we will be reverting back to using flock() locks
everywhere. However, that still leaves the problem of locking parts
of a memseg list file in single file segments mode, and we will be
solving it with creating separate lock files per each page, and
tracking those with flock().

We will also be removing all of this tailq business and replacing it
with a simple array - saving a few bytes is not worth the extra
hassle of dealing with pointers and potential memory allocation
failures. Also, remove the tailq lock since it is not needed - these
fd lists are per-process, and within a given process, it is always
only one thread handling access to hugetlbfs.

So, first one to allocate a segment will create a lockfile, and put
a shared lock on it. When we're shrinking the page file, we will be
trying to take out a write lock on that lockfile, which would fail if
any other process is holding onto the lockfile as well. This way, we
can know if we can shrink the segment file. Also, if no other locks
are found in the lock list for a given memseg list, the memseg list
fd is automatically closed.

One other thing to note is, according to flock() Ubuntu manpage [2],
upgrading the lock from shared to exclusive is implemented by dropping
and reacquiring the lock, which is not atomic and thus would have
created race conditions. So, on attempting to perform operations in
hugetlbfs, we will take out a writelock on hugetlbfs directory, so
that only one process could perform hugetlbfs operations concurrently.

[1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
[2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html

Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Fixes: 2a04139f66 ("eal: add single file segments option")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-04-27 23:52:51 +02:00
Anatoly Burakov
046aa5c447 mem: add memalloc init stage
Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-04-27 23:52:51 +02:00