Some static analyzers complain about it, even though
value is never used if not initialized. To avoid additional
false positives about a potential null-pointer dereferences,
also add a null-check.
Bugzilla ID: 58
Fixes: ea2dc10668 ("vfio: add multi container support")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
The value is not used, but some static analyzers may give out a
warning. Fix it by assigning default value of zero.
Bugzilla ID: 58
Fixes: cdc242f260 ("eal/linux: support running as unprivileged user")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Parentheses were missing. It worked because macro is enclosed in
parentheses, so syntax was valid after macro expansion.
Bugzilla ID: 58
Fixes: 0a45657a67 ("pci: rework interrupt handling")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Previously, it was possible to limit maximum amount of memory
allowed for allocation by creating validator callbacks. Although a
powerful tool, it's a bit of a hassle and requires modifying the
application for it to work with DPDK example applications.
Fix this by adding a new parameter "--socket-limit", with syntax
similar to "--socket-mem", which would set per-socket memory
allocation limits, and set up a default validator callback to deny
all allocations above the limit.
This option is incompatible with legacy mode, as validator callbacks
are not supported there.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, reserving zero-length memzones is done by looking at
malloc statistics, and reserving biggest sized element found in those
statistics. This has two issues.
First, there is a race condition. The heap is unlocked between the
time we check stats, and the time we reserve malloc element for memzone.
This may lead to inability to reserve the memzone we wanted to reserve,
because another allocation might have taken place and biggest sized
element may no longer be available.
Second, the size returned by malloc statistics does not include any
alignment information, which is worked around by being conservative and
subtracting alignment length from the final result. This leads to
fragmentation and reserving memzones that could have been bigger but
aren't.
Fix all of this by using earlier-introduced operation to reserve
biggest possible malloc element. This, however, comes with a trade-off,
because we can only lock one heap at a time. So, if we check the first
available heap and find *any* element at all, that element will be
considered "the biggest", even though other heaps might have bigger
elements. We cannot know what other heaps have before we try and
allocate it, and it is not a good idea to lock all of the heaps at
the same time, so, we will just document this limitation and
encourage users to reserve memzones with socket id properly set.
Also, fixup unit tests to account for the new behavior.
Fixes: fafcc11985 ("mem: rework memzone to be allocated by malloc")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Add an internal-only function to allocate biggest element from
the heap. Nominally, it supports SOCKET_ID_ANY as its socket
argument, but it's essentially useless because other sockets
will only be allocated from if the entire heap on current or
specified socket is busy.
Still, asking to reserve a biggest element will allow fixing
race condition in memzone reserve that has been there for a
long time.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Remy Horton <remy.horton@intel.com>
Adding internal-only function to find biggest free IOVA-contiguous
malloc element. This is not exposed to external API.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Remy Horton <remy.horton@intel.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Previously, when joining adjacent free elements, we were erasing
trailer and header, but did not erase the padding. Fix this by
accounting for padding on erase, and do not erase padding twice
by adjusting data pointer and data len to not include padding.
Fixes: bb372060da ("malloc: make heap a doubly-linked list")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_list_walk() inside callbacks.
Also, remove existing reimplementation from memalloc code and use
the new API instead.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_walk() inside callbacks.
Also, remove existing reimplementation from sPAPR VFIO code and use
the new API instead.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_contig_walk() inside callbacks.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
When rte_eal_cleanup() is called, it is expected that DPDK will be able to
release all of its memory back to the system. However, if pages are marked
as unfreeable, the pages will not be released back. Fix this to mark all
pages as freeable on calling rte_eal_cleanup(), but only do it for primary
process, as secondaries can come and go.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, all hugepages are allocated from lower VA address to
higher VA address, while malloc heap allocates from higher VA
address to lower VA address. This results in heap fragmentation
over time due to multiple reserves leaving small space below the
allocated elements.
Fix this by allocating VA memory from the top, thereby reducing
fragmentation and lowering overall memory usage.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Add a function to return starting point of current contiguous
block, going backwards. All semantics are kept the same as the
existing function, with the only difference being that given the
same input, results will be returned in reverse order.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Add a function to look for N used/free slots, but going backwards
instead of forwards. All semantics are kept similar to the existing
function, with the difference being that given the same input, the
same results will be returned in reverse order.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Add function to look up used/free indexes starting from specified
index, but going backwards instead of forward. Semantics are kept
similar to the existing function, except for the fact that, given
the same input, the results returned will be in reverse order.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Mostly a code move, to have all code related to find_contig in
one place. This slightly changes the API in that previously,
calling find_contig_free() on a full fbarray would've been
an error, but equivalent call to find_contig_used() on an empty
array does not return an error, leading to an inconsistency in
the API.
The decision was made to not treat this condition as an error,
because it is equivalent to calling find_contig() on an index
that just happens to be used/free, which is not an error and
will return 0.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Errno values are supposed to be positive, yet they were negative.
This changes API, so not backporting.
Fixes: c44d09811b ("eal: add shared indexed file-backed array")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
In the perfect world, it wouldn't matter how much memory was
preallocated because most of it was always going to be private
anonymous zero-page mappings for the duration of the program.
However, in practice, due to peculiarities of FreeBSD, we need
to additionally limit memory allocation there. This patch moves
the segment preallocation to EAL private functions that will be
implemented by an OS-specific EAL rather than being in the common
memory-related code.
Since there is no support for growing/shrinking memory use at
runtime on FreeBSD anyway, this does not inhibit any functionality
but makes core dumps faster even on default settings.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Previously, memory allocator always left holes between mapped
contigmem segments, even if they were IOVA-contiguous. Fix this
by remembering last IOVA address and memseg index, and checking
against those when mapping new contigmem segments.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Segment index was set to 0 at start but was never incremented.
This has no consequences other than displayed number of segments
allocated at initialization. Fix this by incrementing it after
displaying.
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
This function returned positive error numbers instead
of negative ones as desbribed in the doc. What's worse,
multiple of its callers only check for (rc < 0) to detect
failure.
It was incorrectly assumed that pthread_create
and pthread_setaffinity_np return negative errnos. They
always returns positive ones, so this patch negates their
return values before returning.
Fixes: 9e5afc72c9 ("eal: add function to create control threads")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
The doc says this function returns negative errno
on error, but it currently returns either -1 or
positive errno.
It was incorrectly assumed that pthread_setname_np()
returns negative error numbers. It always returns
positive ones, so this patch negates its return value
before returning.
Fixes: 3901ed99c2 ("eal: fix thread naming on FreeBSD")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
The error is not fatal and we can physically continue
creating the thread. It simply won't have a name.
If rte_thread_setname() fails, we will just print
a debug log now. EAL does the same for lcore threads.
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
Since secondary process' address space is highly dictated
by the primary process' mappings, it doesn't make much
sense to use base-virtaddr for secondary processes.
This patch is intended to fix PCI resource mapping
in secondary processes using the same base-virtaddr
as their primary processes. PCI uses the end of the hugepage
memory area to map all resources. [pci_find_max_end_va()]
It works for primary processes, but can't be mapped 1:1
by secondary ones, as the same addresses are currently always
occupied by shadow memseg lists, which were created with
eal_get_virtual_area(NULL, ...).
```
PRIMARY PROCESS
0x6e00e00000 388K rw-s- fbarray_memseg-2048k-1-3
0x6e01000000 16777216K r---- [ anon ]
0x7201000000 16K rw-s- resource0
SECONDARY PROCESS
0x6e00e00000 388K rw-s- fbarray_memseg-2048k-1-3
0x6e01000000 16777216K r---- [ anon ]
0x7201000000 4K rw-s- fbarray_memseg-1048576k-0-0_203213
```
Fixes: 524e43c2ad ("mem: prepare memseg lists for multiprocess sync")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Whenever a calculated base-virtaddr offset had to be
manually aligned to requested page_sz, we did not take
account of that alignment in incrementing the base-virtaddr
offset further. The next requested virtual area could print
a warning "hint [...] not respected!" and let the system
pick an address instead. As a result, this breaks secondary
process support on many system configurations.
Fixes: b7cc54187e ("mem: move virtual area function in common directory")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Although the alignment mechanism works as intended, the
`no_align` bool flag was set incorrectly. We were aligning
buffers that didn't need extra alignment, and weren't
aligning ones that really needed it.
Fixes: b7cc54187e ("mem: move virtual area function in common directory")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
When trying to use it with an address that's not
managed by DPDK it would segfault due to a missing
check. The doc says this function returns either
a pointer or NULL, so let it do so.
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
This isn't documented in the manuals, but a failed
mmap(..., MAP_FIXED) may still unmap overlapping
regions. In such case, we need to remap these regions
back into our address space to ensure mem contiguity.
We do it unconditionally now on mmap failure just to
be safe.
Verified on Linux 4.9.0-4-amd64. I was getting
ENOMEM when trying to map hugetlbfs with no space
left, and the previous anonymous mapping was still
being removed.
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
EAL reserves a huge area in virtual address space
to provide virtual address contiguity for e.g.
future memory extensions (memory hotplug). During
memory hotplug, if the hugepage mmap succeeds but
doesn't suffice EAL's requiriments, the EAL would
unmap this mapping straight away, leaving a hole in
its virtual memory area and making it available
to everyone. As EAL still thinks it owns the entire
region, it may try to mmap it later with MAP_FIXED,
possibly overriding a user's mapping that was made
in the meantime.
This patch ensures each hole is mapped back by EAL,
so that it won't be available to anyone else.
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Validate RTE_HASH_BUCKET_ENTRIES during compilation instead of
run time.
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Existing service functions allow us to stop a service, but doing so doesn't
guarantee that the service has finished running on a service core. This
commit introduces rte_service_may_be_active(), which returns whether the
service may be executing on one or more lcores currently, or definitely is
not.
The service core layer supports this function by setting a flag when
a service core is going to execute a service, and unsetting the flag when
the core is no longer able to run the service (its runstate becomes stopped
or the lcore is no longer mapped).
With this new function, applications can set a service's runstate to
stopped, then poll rte_service_may_be_active() until it returns false. At
that point, the service is quiesced.
Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
A constructor is usually declared with RTE_INIT* macros.
As it is a static function, no need to declare before its definition.
The macro is used directly in the function definition.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Add APIs that allow an application to query and reset the attributes of
a service lcore. Add one such new attribute, "loops", which is a
counter that tracks the number of times the service core has looped in
the service runner function. This is useful to applications that desire
a "liveness" check to make sure a service core is not stuck.
Signed-off-by: Erik Gabriel Carrillo <erik.g.carrillo@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
For cryptodev dynamic logging, conditional compilation of
debug logs is not actually required.
Signed-off-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
Reviewed-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
static logging macro RTE_PMD_DEBUG_TRACE is enabled with a few DEBUG
config options, including RTE_LIBRTE_ETHDEV_DEBUG
RTE_LIBRTE_ETHDEV_DEBUG is still used for data path logging, but all
ethdev logging switched to dynamic logging, so no need to enable static
logging macro for ethdev.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Previously, we were putting an exclusive lock to prevent secondary
processes spinning up while we are sending our messages. However,
using exclusive locks had an effect of disallowing multiple
simultaenous unrelated messages/requests being sent, which was
not the intention behind locking.
Fix it to put a shared lock on the directory. That way, we still
prevent secondary process initializations while sending data over
IPC, but allow multiple unrelated transmissions to proceed.
Fixes: 89f1fe7e6d ("eal: lock IPC directory on init and send")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Qi Zhang <qi.z.zhang@intel.com>
Rather than copy the log message, we can use a precision in the format
string given to syslog.
Signed-off-by: David Marchand <david.marchand@6wind.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
The functions
- vfio_get_container_fd
- vfio_get_group_fd
- vfio_get_group_no
have been renamed to
- rte_vfio_get_container_fd
- rte_vfio_get_group_fd
- rte_vfio_get_group_num
The old names are removed from the map file.
Fixes: 964b2f3bfb ("vfio: export some internal functions")
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Deprecate rte_eal_mbuf_default_mempool_ops(), it shall be replaced by
rte_mbuf_best_mempool_ops().
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Currently, memzone allocation with length set to 0 that are also
IOVA-contiguous is not supported. Document this limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Mem event and validator callbacks may not be supported under all
circumstances (such as when running in legacy memory mode, or on
FreeBSD), and this case needs to be handled by any code that will
use these callbacks. Spell this out more clearly, because it's not
immediately obvious that this is an expected use case.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
GCC 8.1 warned:
In function 'rte_rwlock_read_lock':
rte_rwlock.h:74:12: warning: conversion to 'uint32_t'
{aka 'unsigned int'} from 'int32_t' {aka 'int'} may
change the sign of the result [-Wsign-conversion]
x, x + 1);
^
rte_rwlock.h:74:17: warning: conversion to 'uint32_t'
{aka 'unsigned int'} from 'int' may change the sign
of the result [-Wsign-conversion]
x, x + 1);
~~^~~
In function 'rte_rwlock_write_lock':
rte_rwlock.h:110:15: warning: unsigned conversion
from 'int' to 'uint32_t' {aka 'unsigned int'}
changes value from '-1' to '4294967295' [-Wsign-conversion]
0, -1);
^~
Again in this case we are making explicit the exact cast
that was always happening implicitly. The patch does not
change the generated code.
The int32_t temp "x" is required to be signed to detect
a < 0 error condition from the lock status. Afterwards,
it has always been implicitly cast to uint32_t when it
is used in the arguments to rte_atomic32_cmpset()...
gcc8.1 objects to the implicit cast now and requires us
to cast it explicitly.
Fixes: af75078fec ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
GCC 8.1 warned:
rte_memcpy.h:793:2: note: in expansion of macro 'MOVEUNALIGNED_LEFT47'
MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
^~~~~~~~~~~~~~~~~~~~
rte_memcpy.h:649:51: warning: conversion from 'size_t'
{aka 'long unsigned int'} to 'int' may change value [-Wconversion]
case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;
^
rte_memcpy.h:616:15: note: in definition of macro 'MOVEUNALIGNED_LEFT47_IMM'
tmp = len;
^~~
rte_memcpy.h:793:2: note: in expansion of macro 'MOVEUNALIGNED_LEFT47'
MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
^~~~~~~~~~~~~~~~~~~~
rte_memcpy.h:618:13: warning: conversion to 'size_t'
{aka 'long unsigned int'} from 'int'
may change the sign of the result [-Wsign-conversion]
tmp -= len;
^~
rte_memcpy.h:649:16: note: in expansion of macro 'MOVEUNALIGNED_LEFT47_IMM'
case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;
^~~~~~~~~~~~~~~~~~~~~~~~
rte_memcpy.h:793:2: note: in expansion of macro 'MOVEUNALIGNED_LEFT47'
MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
^~~~~~~~~~~~~~~~~~~~
rte_memcpy.h:618:13: warning: conversion to 'size_t'
{aka 'long unsigned int'} from 'int'
may change the sign of the result [-Wsign-conversion]
tmp -= len;
^~
We can eliminate the problems by setting the type of tmp to
size_t in the first place.
Fixes: d35cc1fe6a ("eal/x86: revert select optimized memcpy at run-time")
Cc: stable@dpdk.org
Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Andy Green <andy@warmcat.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
loglevel set wrong when ":" is used as separator, like
--log-type="user:debug"
This is because fnmatch returns zero on success. Fixed fnmatch return
value check.
Fixes: 7f0bb634a1 ("log: add ability to match log type with globbing")
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Executable bit must be set on directories for normal users to enter them.
This patch addresses the inability to start DPDK applications as non-root
due to errors such as:
EAL: failed to bind /tmp/dpdk/rte/mp_socket: Permission denied
Fixes: 56236363b4 ("eal: add directory for runtime data")
Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
GCC 8.1 warns:
rte_byteorder.h: In function 'rte_constant_bswap16':
rte_byteorder.h:54:45: warning: conversion from
'int' to 'uint16_t' {aka 'short unsigned int'}
may change value [-Wconversion]
((((uint16_t)(v) & UINT16_C(0x00ff)) << 8) | \
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
(((uint16_t)(v) & UINT16_C(0xff00)) >> 8))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
rte_byteorder.h:126:9: note: in expansion of macro
'RTE_STATIC_BSWAP16'
return RTE_STATIC_BSWAP16(x);
^~~~~~~~~~~~~~~~~~
The other two sizes are going to be afflicted the
same, so get the same fix.
Fixes: b75667ef9f ("eal: add static endianness conversion macros")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
GCC 8.1 warns:
In function 'rte_srand':
rte_random.h:34:10:
warning: conversion to 'long int' from 'long unsigned int'
may change the sign of the result [-Wsign-conversion]
srand48((long unsigned int)seedval);
rte_random.h:51:8:
warning: conversion to 'uint64_t' {aka 'long unsigned int'}
from 'long int' may change the sign of the result
[-Wsign-conversion]
val = lrand48();
^~~~~~~
rte_random.h:53:6:
warning: conversion to 'long unsigned int' from 'long int'
may change the sign of the result [-Wsign-conversion]
val += lrand48();
Fixes: af75078fec ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
GCC 8.1 warns:
rte_string_fns.h: In function 'rte_strlcpy':
rte_string_fns.h:58:9:
warning: conversion to 'size_t' {aka 'long unsigned int'} from
'int' may change the sign of the result [-Wsign-conversion]
return snprintf(dst, size, "%s", src);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: 5364de644a ("eal: support strlcpy function")
Signed-off-by: Andy Green <andy@warmcat.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
The intention of the original code was to create runtime data
directory as early as possible, however it was moved too early,
before the arguments were parsed, resulting in --file-prefix
option essentially not working.
Fix this by moving eal_create_runtime_dir() to after command
line arguments parsing.
Fixes: 56236363b4 ("eal: add directory for runtime data")
Reported-by: Andrew Rybchenko <arybchenko@solarflare.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
Fix all calls to functions in eal_filesystem to produce paths
residing inside dedicated DPDK runtime directory. Leaving DPDK
runtime config in place as 3rd-party applications within the
DPDK ecosystem might rely on this path to determine whether
DPDK is running, so moving that will be postponed to the next
release cycle.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, during runtime, DPDK will store a bunch of files here
and there (in /var/run, /tmp or in $HOME). Fix it by creating a
DPDK-specific runtime directory, under which all runtime data
will be placed. The template for creating this runtime directory
is the following:
<base path>/dpdk/<DPDK prefix>/
Where <base path> is set to either "/var/run" if run as root, or
$XDG_RUNTIME_DIR if run as non-root, with a fallback to /tmp if
$XDG_RUNTIME_DIR is not defined. So, for example, if run as root,
by default all runtime data will be stored at /var/run/dpdk/rte/.
There is no equivalent of "mkdir -p", so we will be creating the
path step by step.
Nothing uses this new path yet, changes for that will come in
next commit.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
The original name for this path was not too descriptive and
confusing. Rename it to a more appropriate and descriptive name:
it stores data about hugepages, so name it eal_hugepage_data_path().
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
The define was a leftover from IVSHMEM library.
Fixes: c711ccb309 ("ivshmem: remove library and its EAL integration")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: David Marchand <david.marchand@6wind.com>
Remove version tag from experimental block in linker version scripts
(.map files).
That label is not used by linker and information only. It is useful
for version blocks but not useful for experimental block but confusing.
Removing those labels.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Currently, page deallocation might fail if allocator cannot get page
fd, which will leave VA space still mapped, and will also not mark
page as free.
Fix page deallocation function to always unmap space before trying
to get rid of the page itself, and always mark page as free even if
page deallocation failed.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Fixes: 1a7dc2252f ("mem: revert to using flock and add per-segment lockfiles")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Return value should be zero for success, but if unlock and unlink
have succeeded, return value was 1, which triggered failure message
in calling code.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Segment index was calculated incorrectly, causing free_seg to
attempt to free segments that do not exist.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Yong Liu <yong.liu@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
If total memory is already bigger than max memory, an underflow
will occur on subtraction. Fix it by simply stopping whenever
we already have amount of memory that is bigger than maximum.
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, reserving a memzone with length set to 0 will not trigger
any memory allocations, and memzone will instead be looking through
already allocated memory only. Document this limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Size of malloc heap elements include overhead, which should not
be counted as part of memzone.
Fixes: fafcc11985 ("mem: rework memzone to be allocated by malloc")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Deallocation used the wrong function, which could have resulted in
race conditions because the function does not use locks internally.
Fixes: 1403f87d4f ("malloc: enable memory hotplug support")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
When we ask to reserve virtual areas, we usually include
alignment in the mapping size, and that memory ends up
being wasted. Wasting a gigabyte of VA space while trying to
reserve one gigabyte is pretty expensive on 32-bit, so after
we're done mapping, unmap unneeded space.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Mapping size is a 64-bit integer, but mmap() will accept size_t for
size mappings. A user could request a mapping with an alignment, which
would have overflown size_t, so check if (size + alignment) will
overflow size_t.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
The code aimed to pick and remember the value of
mempool ops name from EAL command line arguments does not
copy the string and remembers the pointer provided
by getopt_long() directly. The latter could be clobbered
later and result in reading wrong mbuf pool ops name
by rte_mempool library.
Typically, this flaw could be avoided by using strdup()
to remember the string value of the option.
Fixes: a103a97e71 ("eal: allow user to override default mempool driver")
Cc: stable@dpdk.org
Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
In function 'rte_try_tm':
rte_spinlock.h:82:2:
warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
int retries = RTE_RTM_MAX_RETRIES;
Fixes: ba7468997e ("spinlock: add HTM lock elision for x86")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
rte_lcore.h: In function 'rte_lcore_index':
rte_lcore.h:122:14:
warning: conversion to 'int' from 'unsigned int' may change
the sign of the result [-Wsign-conversion]
lcore_id = rte_lcore_id();
Fixes: 5583037a79 ("eal: get relative core index")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
rte_common.h:416:9:
warning: conversion to 'uint32_t' {aka 'unsigned int'} from
'int' may change the sign of the result [-Wsign-conversion]
return __builtin_ctz(v);
^~~~~~~~~~~~~~~~
The builtin is defined to return int, but we want to
return it as uint32_t. Its only defined valid return
values are positive integers or zero, which is OK for
uint32_t. So just add an explicit cast.
Fixes: 03f6bced5b ("eal: use intrinsic function")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
It may be useful to pass arbitrary data to the callback (such
as device pointers), so add this to the mem event callback API.
Suggested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Currently, when deallocating pages, malloc will fixup other
elements' headers if there is not enough space to store a full
element in leftover space. This leads to race conditions because
there are some functions that check for pad size with an unlocked
heap, expecting pad size to be constant.
Fix it by being more conservative and only freeing pages when
there is enough space before and after the page to store a free
element.
Fixes: 1403f87d4f ("malloc: enable memory hotplug support")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
The pad value is not used unless element is in pad state, but it
will show up in heap dumps and may be confusing.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
After below commit, we encounter some strange issue:
1) Dead lock as described here:
http://dpdk.org/ml/archives/dev/2018-April/099806.html
2) SIGSEGV issue when starting a testpmd in VM.
Considering below commit changes to use dynamic memory instead of
stack for memory barrier, we doubt it's caused by use-after-free.
Fixes: 3d09a6e26d ("eal: fix threads block on barrier")
Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reported-by: Lei Yao <lei.a.yao@intel.com>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Suggested-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
params is not freed if pthread_create() fails. The fix is
straight-forward.
Fixes: 3d09a6e26d ("eal: fix threads block on barrier")
Reported-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
When heap initializes, we need to add already allocated segments
onto the heap. However, in doing that, we never increased total
heap size. Fix it by adding segment length to total heap length
when initializing the heap.
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
At hugepage info initialization, EAL takes out a write lock on
hugetlbfs directories, and drops it after the memory init is
finished. However, in non-legacy mode, if "-m" or "--socket-mem"
switches are passed, this leads to a deadlock because EAL tries
to allocate pages (and thus take out a write lock on hugedir)
while still holding a separate hugedir write lock in EAL.
Fix it by checking if write lock in hugepage info is active, and
not trying to lock the directory if the hugedir fd is valid.
Fixes: 1a7dc2252f ("mem: revert to using flock and add per-segment lockfiles")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Shahaf Shuler <shahafs@mellanox.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
The original implementation used flock() locks, but was later
switched to using fcntl() locks for page locking, because
fcntl() locks allow locking parts of a file, which is useful
for single-file segments mode, where locking the entire file
isn't as useful because we still need to grow and shrink it.
However, according to fcntl()'s Ubuntu manpage [1], semantics of
fcntl() locks have a giant oversight:
This interface follows the completely stupid semantics of System
V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
locks associated with a file for a given process are removed
when any file descriptor for that file is closed by that process.
This semantic means that applications must be aware of any files
that a subroutine library may access.
Basically, closing *any* fd with an fcntl() lock (which we do because
we don't want to leak fd's) will drop the lock completely.
So, in this commit, we will be reverting back to using flock() locks
everywhere. However, that still leaves the problem of locking parts
of a memseg list file in single file segments mode, and we will be
solving it with creating separate lock files per each page, and
tracking those with flock().
We will also be removing all of this tailq business and replacing it
with a simple array - saving a few bytes is not worth the extra
hassle of dealing with pointers and potential memory allocation
failures. Also, remove the tailq lock since it is not needed - these
fd lists are per-process, and within a given process, it is always
only one thread handling access to hugetlbfs.
So, first one to allocate a segment will create a lockfile, and put
a shared lock on it. When we're shrinking the page file, we will be
trying to take out a write lock on that lockfile, which would fail if
any other process is holding onto the lockfile as well. This way, we
can know if we can shrink the segment file. Also, if no other locks
are found in the lock list for a given memseg list, the memseg list
fd is automatically closed.
One other thing to note is, according to flock() Ubuntu manpage [2],
upgrading the lock from shared to exclusive is implemented by dropping
and reacquiring the lock, which is not atomic and thus would have
created race conditions. So, on attempting to perform operations in
hugetlbfs, we will take out a writelock on hugetlbfs directory, so
that only one process could perform hugetlbfs operations concurrently.
[1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
[2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Fixes: 2a04139f66 ("eal: add single file segments option")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
For non-legacy mode, we are preallocating space for hugepages, so
we know in advance which pages we will be able to allocate, and
which we won't. However, the init procedure was using hugepage
counts gathered from sysfs and paid no attention to hugepage
sizes that were actually available for reservation, and failed
on attempts to reserve unavailable pages.
Fix this by limiting total page counts by number of pages
actually preallocated.
Also, VA preallocate procedure only looks at mountpoints that are
available, and expects pages to exist if a mountpoint exists. That
might not necessarily be the case, so also check if there are
hugepages available for a particular page size on a particular
NUMA node.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
Previously, if we couldn't preallocate VA space on 32-bit for
one page size, we simply bailed out, even though we could've
tried allocating VA space with other page sizes.
For example, if user had both 1G and 2M pages enabled, and
has asked DPDK to allocate memory on both sockets, DPDK
would've tried to allocate VA space for 1x1G page on both
sockets, failed and never tried again, even though it
could've allocated the same 1G of VA space for 512x2M pages.
Fix this by retrying with different page sizes if VA space
reservation failed.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
32-bit mode has an upper limit on amount of VA space it can preallocate,
but the original implementation used the wrong constant, resulting in
failure to initialize due to integer overflow. Fix it by using the
correct constant.
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
Previous code checked for both first/last elements being NULL,
but if they weren't, the expectation was that they're both
non-NULL, which will be the case under normal conditions, but
may not be the case due to heap structure corruption.
Coverity issue: 272566
Fixes: bb372060da ("malloc: make heap a doubly-linked list")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Technically, while the pointer would've been invalid if msl_idx
were invalid, we wouldn't have actually attempted to access the
pointer until verifying the index. Fix it by moving array access
to after we've verified validity of the index.
Coverity issue: 272574
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
If user has specified a flag to unmap the area right after mapping it,
we were passing an already-unmapped pointer to RTE_LOG. This is not an
issue since RTE_LOG doesn't actually dereference the pointer, but fix
it anyway by moving call to RTE_LOG to before unmap.
Coverity issue: 272584
Fixes: b7cc54187e ("mem: move virtual area function in common directory")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Coverity reports these lines as having no effect. Technically, we do
want for those lines to have no effect, however they would've likely
been optimized out. Add volatile qualifiers to ensure the code has
effects.
Coverity issue: 272608
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Previously, if mmap failed to map page address at requested
address, we were attempting to unmap the wrong address. Fix it
by unmapping our actual mapped address, and jump further to
avoid unmapping memory that is not allocated.
Coverity issue: 272602
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Previous code had an old rebase leftover from the time when
oldpolicy was an actual int, instead of a pointer. Fix it to
do comparison with dereferencing the pointer.
Coverity issue: 272589
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Normally, tailq entry should have a valid fd by the time we attempt
to map the segment. However, in case it doesn't, we're leaking fd,
so fix it.
Coverity issue: 272570
Fixes: 2a04139f66 ("eal: add single file segments option")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
We close fd if we managed to find it in the list of allocated
segment lists (which should always be the case under normal
conditions), but if we didn't, the fd was leaking. Close it if
we couldn't find it in the segment list. This is not an issue
as if the segment is zero length, we're getting rid of it
anyway, so there's no harm in not storing the fd anywhere.
Coverity issue: 272568
Fixes: 2a04139f66 ("eal: add single file segments option")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
We were closing descriptor before checking if mapping has
failed, but if it did, we did a second close afterwards. Fix
it by moving closing descriptor to after we've done all error
checks.
Coverity issue: 272560
Fixes: 2a04139f66 ("eal: add single file segments option")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
resize_hugefile() returns either 0 (which indicates success) or -1
(which indicates failure). We failed to check the success as we
use --single-file-segments option.
Fixes: 2a04139f66 ("eal: add single file segments option")
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Below commit introduced pthread barrier for synchronization.
But two IPC threads block on the barrier, and never wake up.
(gdb) bt
#0 futex_wait (private=0, expected=0, futex_word=0x7fffffffcff4)
at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1 futex_wait_simple (private=0, expected=0, futex_word=0x7fffffffcff4)
at ../sysdeps/nptl/futex-internal.h:135
#2 __pthread_barrier_wait (barrier=0x7fffffffcff0) at pthread_barrier_wait.c:184
#3 rte_thread_init (arg=0x7fffffffcfe0)
at ../dpdk/lib/librte_eal/common/eal_common_thread.c:160
#4 start_thread (arg=0x7ffff6ecf700) at pthread_create.c:333
#5 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Through analysis, we find the barrier defined on the stack could be the
root cause. This patch will change to use heap memory as the barrier.
Fixes: d651ee4919 ("eal: set affinity for control threads")
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
This patch adds APIs to support container create/destroy and device
bind/unbind with a container. It also provides API for IOMMU programing
on a specified container.
A driver could use "rte_vfio_container_create" helper to create a new
container from eal, use "rte_vfio_container_group_bind" to bind a device
to the newly created container. During rte_vfio_setup_device the container
bound with the device will be used for IOMMU setup.
Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently eal vfio framework binds vfio group fd to the default
container fd during rte_vfio_setup_device, while in some cases,
e.g. vDPA (vhost data path acceleration), we want to put vfio group
to a separate container and program IOMMU via this container.
This patch extends the vfio_config structure to contain per-container
user_mem_maps and defines an array of vfio_config. The next patch will
base on this to add container API.
Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
The auxiliary vector read is implemented only for Linux.
It could be done with procstat_getauxv() for FreeBSD.
Since the commit below, the auxiliary vector functions
are compiled for every architectures, including x86
which is tested with FreeBSD.
This patch is moving the Linux implementation in Linux directory,
and adding a fake/empty implementation for FreeBSD.
Fixes: 2ed9bf3307 ("eal: abstract away the auxiliary vector")
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
The fake getauxval function does not use its parameter.
So the compiler raised this error:
lib/librte_eal/common/eal_common_cpuflags.c:25:25: error:
unused parameter 'type'
Fixes: 2ed9bf3307 ("eal: abstract away the auxiliary vector")
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
This message looks suspicious and seen on healthy testpmd.
EAL: WARNING: Master core has no memory on local socket!
The message is wrong: the master lcore is 0 and its socket is 0
and there are multiple available memory segments on socket 0.
At that point in the startup process, the count value is zero,
meaning they are not used yet so the check_socket gets confused.
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
rte_lcore_has_role() returns 0 if role of lcore matches requested
role. The return value of the API is confusing, and this is a known
problem with a deprecation notice announcing the change to more
intuitive semantics:
Commit 064518f68d ("doc: announce EAL API change to lcore role function")
Implement changes announced in the deprecation notice, and remove it.
Also, fix usages of this API to reflect the change. Control thread patches
expected new behavior and were broken before, now they are fixed as well.
Fixes: d651ee4919 ("eal: set affinity for control threads")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Erik Gabriel Carrillo <erik.g.carrillo@intel.com>
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
This commit removes the experimental tags from the
service cores functions, they now become part of the
main DPDK API/ABI.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Coverity was complaining about not checking result of call to
fcntl() for unlocking the file. Disregarding the fact that error
value returned from fcntl() unlock call is highly unlikely in the
first place, we are subsequently calling close() on that same fd,
which will drop the lock, which makes call to fcntl() unnecessary.
Fix this by removing a call to fcntl() altogether.
Coverity issue: 272607
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Regular expressions are not the best way to match a hierarchical
pattern like dynamic log levels. And the separator for dynamic
log levels is period which is the regex wildcard character.
A better solution is to use filename matching 'globbing' so
that log levels match like file paths. For compatibility,
use colon to separate pattern match style arguments. For
example:
--log-level 'pmd.net.virtio.*:debug'
This also makes the documentation match what really happens
internally.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
We don't want format of eal log level saved values to be visible
in ABI. Move to private storage in eal_common_log.
Includes minor optimization. Compile the regular expression for
each log match once, rather than each time it is used.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Rather than attempting to load the contents of the auxv directly,
prefer to use an exposed API - and if that doesn't exist then attempt
to load the vector. This is because on some systems, when a user
is downgraded, the /proc/self/auxv file retains the old ownership
and permissions. The original method of /proc/self/auxv is retained.
This also removes a potential abort() in the code when compiled with
NDEBUG. A quick parse of the code shows that many (if not all) of
the CPU flag parsing isn't used internally, so it should be okay.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Add the priority RTE_PRIORITY_LAST, used for initialization routines
meant to be run after all other constructors.
This priority becomes the default priority for all DPDK constructors.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Build a central list to quickly see each used priorities for
constructors, allowing to verify that they are both above 100 and in the
proper order.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
The previous symbols were deprecated for two releases.
They are now marked as such and cannot be used anymore.
They are replaced by ones respecting the new namespace that are marked
experimental.
As a result, eth_dev attach and detach are slightly reworked to follow
the changes.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
rte_eal_devargs is useless, rte_devargs is sufficient.
Only experimental functions are changed for now.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
rte_eal_devargs_parse can be used by EAL subsystems, drivers,
applications alike.
Device parameters may be presented with different structure each time;
as a single declaration string or several strings each describing
different parts of the declaration.
To simplify the use of this parsing facility, its parameters are made
variadic.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Initially, rte_devargs was meant to be populated once and sometimes
accessed, then never emptied.
With the new hotplug functionality having better standing, new usage
appeared with repeated addition of devices and their subsequent removal.
Exposing devargs_list pushed bus drivers and libraries to be careless
and inconsistent in their memory management. Making it private will
allow to rationalize this part of the EAL and ensure that fewer memory
leaks occur during operations.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
In preparation to making devargs_list private.
Bus drivers generally need to access rte_devargs pertaining to their
operations. This match is a common operation for bus drivers.
Add a new accessor for the rte_devargs list.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
The management threads must not bother the dataplane or service cores.
Set the affinity of these threads accordingly.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
To avoid code duplication, add a parameter to rte_ctrl_thread_create()
to specify the name of the thread.
This requires to add a wrapper for the thread start routine in
rte_thread_init(), which will first wait that the thread is configured.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
Many parts of dpdk use their own management threads. Introduce a new
wrapper for thread creation that will be extended in next commits to set
the name and affinity.
To be consistent with other DPDK APIs, the return value is negative in
case of error, which was not the case for pthread_create().
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
Only a cosmetic change: the *_LEN defines are already used
when defining the buffer. Using sizeof() ensures that the length
stays consistent, even if the definition is modified.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
Adjust the init sequence: put mp channel init before bus scan
so that we can init the vdev bus through mp channel in the
secondary process before the bus scan.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Reviewed-by: Qi Zhang <qi.z.zhang@intel.com>
In original implementation, timeout event for an async request
will be ignored. As a result, an async request will never
trigger the action if it cannot receive any reply any more.
We fix this by counting timeout as a processed reply.
Fixes: f05e26051c ("eal: add IPC asynchronous request")
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Following below commit, we change some internal function and variable
names:
commit ce3a731235 ("eal: rename IPC request as synchronous one")
Also use calloc to supersede malloc + memset for code clean up.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
gettimeofday() returning a negative value is highly unlikely,
but if it ever happens, we will exit without unlocking the mutex.
Arguably at that point we'll have bigger problems, but fix this
issue anyway.
Coverity issue: 272595
Fixes: f05e26051c ("eal: add IPC asynchronous request")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>
This also silences (or should silence) a few Coverity false
positives where we used strcpy before (Coverity complained
about not checking buffer size, but source buffers were
always known to be sized correctly).
Coverity issue: 260407, 272565, 272582
Fixes: bacaa27540 ("eal: add channel for multi-process communication")
Fixes: f05e26051c ("eal: add IPC asynchronous request")
Fixes: 783b6e5497 ("eal: add synchronous multi-process communication")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>
We get pointer to mask before we check if fbarray is NULL. Fix
by moving getting mask pointer to until after NULL check.
Coverity issue: 272579
Fixes: c44d09811b ("eal: add shared indexed file-backed array")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
fbarray stores its data in a shared file, which is not hidden.
This leads to polluting user's HOME directory with visible
files when running DPDK as non-root. Change fbarray to always
create hidden files by default.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
A previously mapped region is skipped during the search, leading to
DMA unmap fails.
This patch fixes it and rewords the comment.
Fixes: 73a6390859 ("vfio: allow to map other memory regions")
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Support of strlcpy has recently been added to DPDK.
This replacement has been generated by the coccinelle script:
devtools/cocci.sh devtools/cocci/strlcpy.cocci
Fixes: 0d0f478d04 ("eal/linux: add uevent parse and process")
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
The hugedir returned by get_hugepage_dir is allocated by strdup
but not released. Replace snprintf with a more suitable strlcpy.
Coverity issue: 272585
Fixes: cb97d93e9d ("mem: share hugepage info primary and secondary")
Signed-off-by: Yangchao Zhou <zhouyates@gmail.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Sometimes gcc does not inline the function despite keyword *inline*,
we observe rte_movX is not inline when doing performance profiling,
so use *always_inline* keyword to force gcc to inline the function.
Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Previously, vfio uses its own private channel for the secondary
process to get container fd and group fd from the primary process.
This patch changes to use the generic mp channel.
Test:
1. Bind two NICs to vfio-pci.
2. Start the primary and secondary process.
$ (symmetric_mp) -c 2 -- -p 3 --num-procs=2 --proc-id=0
$ (symmetric_mp) -c 4 --proc-type=auto -- -p 3 \
--num-procs=2 --proc-id=1
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
While debugging startup issues encountered with Clang (see "eal: fix
undefined behavior in fbarray"), I noticed that fbarray stores indices,
sizes and masks on signed integers involved in bitwise operations.
Such operations almost invariably cause undefined behavior with values that
cannot be represented by the result type, as is often the case with
bit-masks and left-shifts.
This patch replaces them with unsigned integers as a safety measure and
promotes a few internal variables to larger types for consistency.
Coverity issue: 272598, 272599
Fixes: c44d09811b ("eal: add shared indexed file-backed array")
Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
According to GCC documentation [1], the __builtin_clz() family of functions
yield undefined behavior when fed a zero value. There is one instance in
the fbarray code where this can occur.
Clang (at least version 3.8.0-2ubuntu4) seems much more sensitive to this
than GCC and yields random results when compiling optimized code, as shown
below:
#include <stdio.h>
int main(void)
{
volatile unsigned long long moo;
int x;
moo = 0;
x = __builtin_clzll(moo);
printf("%d\n", x);
return 0;
}
$ gcc -O3 -o test test.c && ./test
63
$ clang -O3 -o test test.c && ./test
1742715559
$ clang -O0 -o test test.c && ./test
63
Even 63 can be considered an unexpected result given the number of leading
zeroes should be the full width of the underlying type, i.e. 64.
In practice it causes find_next_n() to sometimes return negative values
interpreted as errors by caller functions, which prevents DPDK applications
from starting due to inability to find free memory segments:
# testpmd [...]
EAL: Detected 32 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Multi-process socket /var/run/.rte_unix
EAL: eal_memalloc_alloc_seg_bulk(): couldn't find suitable memseg_list
EAL: FATAL: Cannot init memory
EAL: Cannot init memory
PANIC in main():
Cannot init EAL
4: [./build/app/testpmd(_start+0x29) [0x462289]]
3: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)
[0x7f19d54fc830]]
2: [./build/app/testpmd(main+0x8a3) [0x466193]]
1: [./build/app/testpmd(__rte_panic+0xd6) [0x4efaa6]]
Aborted
This problem appears with commit 66cc45e293 ("mem: replace memseg with
memseg lists") however the root cause is introduced by a prior patch.
[1] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Fixes: c44d09811b ("eal: add shared indexed file-backed array")
Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
We lock the hotplug during init, but do not unlock it if we couldn't
register multiprocess callbacks. Add the missing unlock.
Fixes: 07dcbfe010 ("malloc: support multiprocess memory hotplug")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Earlier fix for race condition introduced a bug where mutex
wasn't unlocked if message failed to be sent. Fix all of this
by moving locking out of mp_request_sync() altogether.
Fixes: da5957821b ("eal: fix race condition in IPC request")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
We are trying to notify sender that response from current process
should be ignored, but we didn't specify which request this response
was for. Fix by copying request name from the original message.
Fixes: 579a4ccc34 ("eal: ignore IPC messages until init is complete")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>
Previously, we were removing request from the list only if we
have succeeded to send it. This resulted in leaving an invalid
pointer in the request list.
Fix this by only adding new requests to the request list if we
have succeeded in sending them.
Fixes: f05e26051c ("eal: add IPC asynchronous request")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>
Previously, we were adding synchronous requests to request list, we
were doing it after checking if request existed. However, we only
removed the request from the request list if we have succeeded in
sending the request. In case of failed request send, we left an
invalid pointer in the request list.
Fix this by only adding request to the list once we succeed in
sending it.
Fixes: 783b6e5497 ("eal: add synchronous multi-process communication")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>
EAL did not stop processing further asynchronous requests on
encountering a request that should trigger the callback. This
resulted in erasing valid requests but not triggering them.
Fix this by stopping the loop once we have a request that
can trigger the callback. Once triggered, we go back to scanning
the request queue until there are no more callbacks to trigger.
Fixes: f05e26051c ("eal: add IPC asynchronous request")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>
Previously, VFIO functions were not compiled in and exported if
VFIO compilation was disabled. Fix this by actually compiling
all of the functions unconditionally, and provide missing
prototypes on Linux.
Fixes: 279b581c89 ("vfio: expose functions")
Fixes: 73a6390859 ("vfio: allow to map other memory regions")
Fixes: 964b2f3bfb ("vfio: export some internal functions")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
In order to handle the uevent which has been detected from the kernel
side, add uevent parse and process function to translate the uevent into
device event, which user has subscribed to monitor.
Signed-off-by: Jeff Guo <jia.guo@intel.com>
Reviewed-by: Jianfeng Tan <jianfeng.tan@intel.com>
This patch aims to add a general device event monitor framework at
EAL device layer, for device hotplug awareness and actions adopted
accordingly. It could also expand for all other types of device event
monitor, but not in this scope at the stage.
To get started, users firstly call below new added APIs to enable/disable
the device event monitor mechanism:
- rte_dev_event_monitor_start
- rte_dev_event_monitor_stop
Then users shell register or unregister callbacks through the new added
APIs. Callbacks can be some device specific, or for all devices.
-rte_dev_event_callback_register
-rte_dev_event_callback_unregister
Use hotplug case for example, when device hotplug insertion or hotplug
removal, we will get notified from kernel, then call user's callbacks
accordingly to handle it, such as detach or attach the device from the
bus, and could benefit further fail-safe or live-migration.
Signed-off-by: Jeff Guo <jia.guo@intel.com>
Reviewed-by: Jianfeng Tan <jianfeng.tan@intel.com>
Add new interrupt handle type of RTE_INTR_HANDLE_DEV_EVENT, for
device event interrupt monitor.
Signed-off-by: Jeff Guo <jia.guo@intel.com>
Reviewed-by: Jianfeng Tan <jianfeng.tan@intel.com>
We only need to perform DMA mapping for first device in first group.
At the time of mapping, we haven't yet added the device into the group,
so the count is expected to be zero.
Fixes: 810bfa64c6 ("vfio: fix index for tracking devices in a group")
Fixes: a9c349e3a1 ("vfio: fix device unplug when several devices per group")
Fixes: 94c0776b1b ("vfio: support hotplug")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
This patch moves some of the internal vfio functions from
eal_vfio.h to rte_vfio.h for common uses with "rte_" prefix.
This patch also change the FSLMC bus usages from the internal
VFIO functions to external ones with "rte_" prefix
Signed-off-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
https://dpdk.org/tracker/show_bug.cgi?id=18
Indicated that several mmap call sites in the [linux|bsd]app eal code
set fd that was not -1 in their calls while using MAP_ANONYMOUS. While
probably not a huge deal, the man page does say the fd should be -1 for
portability, as some implementations don't ignore fd as they should for
MAP_ANONYMOUS.
Suggested-by: Solal Pirelli <solal.pirelli@gmail.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Use __atomic_exchange_n instead of __atomic_exchange_(2/4/8).
The error was:
include/generic/rte_atomic.h:215:9: error:
implicit declaration of function '__atomic_exchange_2'
is invalid in C99
include/generic/rte_atomic.h:494:9: error:
implicit declaration of function '__atomic_exchange_4'
is invalid in C99
include/generic/rte_atomic.h:772:9: error:
implicit declaration of function '__atomic_exchange_8'
is invalid in C99
Fixes: ff2863570f ("eal: introduce atomic exchange operation")
Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
It is common sense to expect for DPDK process to not deallocate any
pages that were preallocated by "-m" or "--socket-mem" flags - yet,
currently, DPDK memory subsystem will do exactly that once it finds
that the pages are unused.
Fix this by marking pages as unfreebale, and preventing malloc from
ever trying to free them.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Before allocating a new page, give a chance to the user to
allow or deny allocation via callbacks.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This API will enable application to register for notifications
on page allocations that are about to happen, giving the application
a chance to allow or deny the allocation when total memory utilization
as a result would be above specified limit on specified socket.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Now that every other piece of the puzzle is in place, enable non-legacy
init mode.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Enable callbacks on first device attach, disable callbacks
on last device attach.
PPC64 IOMMU does memseg walk, which will cause a deadlock on
trying to do it inside a callback, so provide a local,
thread-unsafe copy of memseg walk.
PPC64 IOMMU also may remap the entire memory map for DMA while
adding new elements to it, so change user map list lock to a
recursive lock. That way, we can safely enter rte_vfio_dma_map(),
lock the user map list, enter DMA mapping function and lock the
list again (for reading previously existing maps).
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Callbacks will be triggered just after allocation and just
before deallocation, to ensure that memory address space
referenced in the callback is always valid by the time
callback is called.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Each process will have its own callbacks. Callbacks will indicate
whether it's allocation and deallocation that's happened, and will
also provide start VA address and length of allocated block.
Since memory hotplug isn't supported on FreeBSD and in legacy mem
mode, it will not be possible to register them in either.
Callbacks are called whenever something happens to the memory map of
current process, therefore at those times memory hotplug subsystem
is write-locked, which leads to deadlocks on attempt to use these
functions. Document the limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This enables multiprocess synchronization for memory hotplug
requests at runtime (as opposed to initialization).
Basic workflow is the following. Primary process always does initial
mapping and unmapping, and secondary processes always follow primary
page map. Only one allocation request can be active at any one time.
When primary allocates memory, it ensures that all other processes
have allocated the same set of hugepages successfully, otherwise
any allocations made are being rolled back, and heap is freed back.
Heap is locked throughout the process, and there is also a global
memory hotplug lock, so no race conditions can happen.
When primary frees memory, it frees the heap, deallocates affected
pages, and notifies other processes of deallocations. Since heap is
freed from that memory chunk, the area basically becomes invisible
to other processes even if they happen to fail to unmap that
specific set of pages, so it's completely safe to ignore results of
sync requests.
When secondary allocates memory, it does not do so by itself.
Instead, it sends a request to primary process to try and allocate
pages of specified size and on specified socket, such that a
specified heap allocation request could complete. Primary process
then sends all secondaries (including the requestor) a separate
notification of allocated pages, and expects all secondary
processes to report success before considering pages as "allocated".
Only after primary process ensures that all memory has been
successfully allocated in all secondary process, it will respond
positively to the initial request, and let secondary proceed with
the allocation. Since the heap now has memory that can satisfy
allocation request, and it was locked all this time (so no other
allocations could take place), secondary process will be able to
allocate memory from the heap.
When secondary frees memory, it hides pages to be deallocated from
the heap. Then, it sends a deallocation request to primary process,
so that it deallocates pages itself, and then sends a separate sync
request to all other processes (including the requestor) to unmap
the same pages. This way, even if secondary fails to notify other
processes of this deallocation, that memory will become invisible
to other processes, and will not be allocated from again.
So, to summarize: address space will only become part of the heap
if primary process can ensure that all other processes have
allocated this memory successfully. If anything goes wrong, the
worst thing that could happen is that a page will "leak" and will
not be available to neither DPDK nor the system, as some process
will still hold onto it. It's not an actual leak, as we can account
for the page - it's just that none of the processes will be able
to use this page for anything useful, until it gets allocated from
by the primary.
Due to underlying DPDK IPC implementation being single-threaded,
some asynchronous magic had to be done, as we need to complete
several requests before we can definitively allow secondary process
to use allocated memory (namely, it has to be present in all other
secondary processes before it can be used). Additionally, only
one allocation request is allowed to be submitted at once.
Memory allocation requests are only allowed when there are no
secondary processes currently initializing. To enforce that,
a shared rwlock is used, that is set to read lock on init (so that
several secondaries could initialize concurrently), and write lock
on making allocation requests (so that either secondary init will
have to wait, or allocation request will have to wait until all
processes have initialized).
Any other function that wishes to iterate over memory or prevent
allocations should be using memory hotplug lock.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This set of changes enables rte_malloc to allocate and free memory
as needed. Currently, it is disabled because legacy mem mode is
enabled unconditionally.
The way it works is, first malloc checks if there is enough memory
already allocated to satisfy user's request. If there isn't, we try
and allocate more memory. The reverse happens with free - we free
an element, check its size (including free element merging due to
adjacency) and see if it's bigger than hugepage size and that its
start and end span a hugepage or more. Then we remove the area from
malloc heap (adjusting element lengths where appropriate), and
deallocate the page.
For legacy mode, runtime alloc/free of pages is disabled.
It is worth noting that memseg lists are being sorted by page size,
and that we try our best to satisfy user's request. That is, if
the user requests an element from a 2MB page memory, we will check
if we can satisfy that request from existing memory, if not we try
and allocate more 2MB pages. If that fails and user also specified
a "size is hint" flag, we then check other page sizes and try to
allocate from there. If that fails too, then, depending on flags,
we may try allocating from other sockets. In other words, we try
our best to give the user what they asked for, but going to other
sockets is last resort - first we try to allocate more memory on
the same socket.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Since we are going to need to map hugepages in both primary and
secondary processes, we need to know where we should look for
hugetlbfs mountpoints. So, share those with secondary processes,
and map them on init.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Add a new (non-legacy) memory init path for EAL. It uses the
new memory hotplug facilities.
If no -m or --socket-mem switches were specified, the new init
will not allocate anything, whereas if those switches were passed,
appropriate amounts of pages would be requested, just like for
legacy init.
Allocated pages will be physically discontiguous (or rather, they're
not guaranteed to be physically contiguous - they may still be so by
accident) unless RTE_IOVA_VA mode is used.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
For non-legacy memory init mode, instead of looking at generic
sysfs path, look at sysfs paths pertaining to each NUMA node
for hugepage counts. Note that per-NUMA node path does not
provide information regarding reserved pages, so we might not
get the best info from these paths, but this saves us from the
whole mapping/remapping business before we're actually able to
tell which page is on which socket, because we no longer require
our memory to be physically contiguous.
Legacy memory init will not use this.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
In preparation for implementing multiprocess support, we are adding
a version number to memseg lists. We will not need any locks, because
memory hotplug will have a global lock (so any time memory map and
thus version number might change, we will already be holding a lock).
There are two ways of implementing multiprocess support for memory
hotplug: either all information about mapped memory is shared
between processes, and secondary processes simply attempt to
map/unmap memory based on requests from the primary, or secondary
processes store their own maps and only check if they are in sync
with the primary process' maps.
This implementation will opt for the latter option: primary process
shared mappings will be authoritative, and each secondary process
will use its own interal view of mapped memory, and will attempt
to synchronize on these mappings using versioning.
Under this model, only primary process will decide which pages get
mapped, and secondary processes will only copy primary's page
maps and get notified of the changes via IPC mechanism (coming
in later commits).
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
For now, memory is always contiguous because legacy mem mode is
enabled unconditionally, but this function will be helpful down
the line when we implement support for allocating physically
non-contiguous memory. We can no longer guarantee physically
contiguous memory unless we're in legacy or IOVA_AS_VA mode, but
we can certainly try and see if we succeed.
In addition, this would be useful for e.g. PMD's who may allocate
chunks that are smaller than the pagesize, but they must not cross
the page boundary, in which case we will be able to accommodate
that request. This function will also support non-hugepage memory.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Currently, DPDK stores all pages as separate files in hugetlbfs.
This option will allow storing all pages in one file (one file
per memseg list).
We do this by using fallocate() calls on FreeBSD, however this is
only supported on fairly recent (4.3+) kernels, so ftruncate()
fallback is provided to grow (but not shrink) hugepage files.
Naming scheme is deterministic, so both primary and secondary
processes will be able to easily map needed files and offsets.
For multi-file segments, we can close fd's right away. For
single-file segments, we can reuse the same fd and reduce the
amount of fd's needed to map/use hugepages. However, we need to
store the fd's somewhere, so we add a tailq.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This isn't used anywhere yet, but the support is now there. Also,
adding cleanup to allocation procedures, so that if we fail to
allocate everything we asked for, we can free all of it back.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Nothing uses this code yet. The bulk of it is copied from old
memory allocation code (linuxapp eal_memory.c). We provide an
EAL-internal API to allocate either one page or multiple pages,
guaranteeing that we'll get contiguous VA for all of the pages
that we requested.
Not supported on FreeBSD.
Locking is done via fcntl() because that way, when it comes to
taking out write locks or unlocking on deallocation, we don't
have to keep original fd's around. Plus, using fcntl() gives us
ability to lock parts of a file, which is useful for single-file
segments, which are coming down the line.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
It's there, so we might as well use it. Some operations will be
sped up by that.
Since we have to allocate an fbarray for memzones, we have to do
it before we initialize memory subsystem, because that, in
secondary processes, will (later) allocate more fbarrays than the
primary process, which will result in inability to attach to
memzone fbarray if we do it after the fact.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Before, we were aggregating multiple pages into one memseg, so the
number of memsegs was small. Now, each page gets its own memseg,
so the list of memsegs is huge. To accommodate the new memseg list
size and to keep the under-the-hood workings sane, the memseg list
is now not just a single list, but multiple lists. To be precise,
each hugepage size available on the system gets one or more memseg
lists, per socket.
In order to support dynamic memory allocation, we reserve all
memory in advance (unless we're in 32-bit legacy mode, in which
case we do not preallocate memory). As in, we do an anonymous
mmap() of the entire maximum size of memory per hugepage size, per
socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or
RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the
smaller one), split over multiple lists (which are limited to
either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST
megabytes per list, whichever is the smaller one). There is also
a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly
used for 32-bit targets to limit amounts of preallocated memory,
but can be used to place an upper limit on total amount of VA
memory that can be allocated by DPDK application.
So, for each hugepage size, we get (by default) up to 128G worth
of memory, per socket, split into chunks of up to 32G in size.
The address space is claimed at the start, in eal_common_memory.c.
The actual page allocation code is in eal_memalloc.c (Linux-only),
and largely consists of copied EAL memory init code.
Pages in the list are also indexed by address. That is, in order
to figure out where the page belongs, one can simply look at base
address for a memseg list. Similarly, figuring out IOVA address
of a memzone is a matter of finding the right memseg list, getting
offset and dividing by page size to get the appropriate memseg.
This commit also removes rte_eal_dump_physmem_layout() call,
according to deprecation notice [1], and removes that deprecation
notice as well.
On 32-bit targets due to limited VA space, DPDK will no longer
spread memory to different sockets like before. Instead, it will
(by default) allocate all of the memory on socket where master
lcore is. To override this behavior, --socket-mem must be used.
The rest of the changes are really ripple effects from the memseg
change - heap changes, compile fixes, and rewrites to support
fbarray-backed memseg lists. Due to earlier switch to _walk()
functions, most of the changes are simple fixes, however some
of the _walk() calls were switched to memseg list walk, where
it made sense to do so.
Additionally, we are also switching locks from flock() to fcntl().
Down the line, we will be introducing single-file segments option,
and we cannot use flock() locks to lock parts of the file. Therefore,
we will use fcntl() locks for legacy mem as well, in case someone is
unfortunate enough to accidentally start legacy mem primary process
alongside an already working non-legacy mem-based primary process.
[1] http://dpdk.org/dev/patchwork/patch/34002/
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
rte_fbarray is a simple indexed array stored in shared memory
via mapping files into memory. Rationale for its existence is the
following: since we are going to map memory page-by-page, there
could be quite a lot of memory segments to keep track of (for
smaller page sizes, page count can easily reach thousands). We
can't really make page lists truly dynamic and infinitely expandable,
because that involves reallocating memory (which is a big no-no in
multiprocess). What we can do instead is have a maximum capacity as
something really, really large, and decide at allocation time how
big the array is going to be. We map the entire file into memory,
which makes it possible to use fbarray as shared memory, provided
the structure itself is allocated in shared memory. Per-fbarray
locking is also used to avoid index data races (but not contents
data races - that is up to user application to synchronize).
In addition, in understanding that we will frequently need to scan
this array for free space and iterating over array linearly can
become slow, rte_fbarray provides facilities to index array's
usage. The following use cases are covered:
- find next free/used slot (useful either for adding new elements
to fbarray, or walking the list)
- find starting index for next N free/used slots (useful for when
we want to allocate chunk of VA-contiguous memory composed of
several pages)
- find how many contiguous free/used slots there are, starting
from specified index (useful for when we want to figure out
how many pages we have until next hole in allocated memory, to
speed up some bulk operations where we would otherwise have to
walk the array and add pages one by one)
This is accomplished by storing a usage mask in-memory, right
after the data section of the array, and using some bit-level
magic to figure out the info we need.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This adds a "--legacy-mem" command-line switch. It will be used to
go back to the old memory behavior, one where we can't dynamically
allocate/free memory (the downside), but one where the user can
get physically contiguous memory, like before (the upside).
For now, nothing but the legacy behavior exists, non-legacy
memory init sequence will be added later. For FreeBSD, non-legacy
memory init will never be enabled, while for Linux, it is
disabled in this patch to avoid breaking bisect, but will be
enabled once non-legacy mode will be fully operational.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Currently it is not possible to use memory that is not owned by DPDK to
perform DMA. This scenarion might be used in vhost applications (like
SPDK) where guest send its own memory table. To fill this gap provide
API to allow registering arbitrary address in VFIO container.
Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This can be used as a virt2iova function that only looks up
memory that is owned by DPDK (as opposed to doing pagemap walks).
Using this will result in less dependency on internals of mem API.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This is reverse lookup of PA to VA. Using this will make
other code less dependent on internals of mem API.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
This function is meant to walk over first segment of each
VA-contiguous group of memsegs.
For future users of this function, this is done so that
there is less dependency on internals of mem API and less
noise later change sets.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>