If total memory is already bigger than max memory, an underflow
will occur on subtraction. Fix it by simply stopping whenever
we already have amount of memory that is bigger than maximum.
Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, reserving a memzone with length set to 0 will not trigger
any memory allocations, and memzone will instead be looking through
already allocated memory only. Document this limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Size of malloc heap elements include overhead, which should not
be counted as part of memzone.
Fixes: fafcc11985a2 ("mem: rework memzone to be allocated by malloc")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Deallocation used the wrong function, which could have resulted in
race conditions because the function does not use locks internally.
Fixes: 1403f87d4fb8 ("malloc: enable memory hotplug support")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
When we ask to reserve virtual areas, we usually include
alignment in the mapping size, and that memory ends up
being wasted. Wasting a gigabyte of VA space while trying to
reserve one gigabyte is pretty expensive on 32-bit, so after
we're done mapping, unmap unneeded space.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Mapping size is a 64-bit integer, but mmap() will accept size_t for
size mappings. A user could request a mapping with an alignment, which
would have overflown size_t, so check if (size + alignment) will
overflow size_t.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
In function 'rte_try_tm':
rte_spinlock.h:82:2:
warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
int retries = RTE_RTM_MAX_RETRIES;
Fixes: ba7468997ea6 ("spinlock: add HTM lock elision for x86")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
rte_lcore.h: In function 'rte_lcore_index':
rte_lcore.h:122:14:
warning: conversion to 'int' from 'unsigned int' may change
the sign of the result [-Wsign-conversion]
lcore_id = rte_lcore_id();
Fixes: 5583037a7950 ("eal: get relative core index")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
rte_common.h:416:9:
warning: conversion to 'uint32_t' {aka 'unsigned int'} from
'int' may change the sign of the result [-Wsign-conversion]
return __builtin_ctz(v);
^~~~~~~~~~~~~~~~
The builtin is defined to return int, but we want to
return it as uint32_t. Its only defined valid return
values are positive integers or zero, which is OK for
uint32_t. So just add an explicit cast.
Fixes: 03f6bced5bba ("eal: use intrinsic function")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
It may be useful to pass arbitrary data to the callback (such
as device pointers), so add this to the mem event callback API.
Suggested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Currently, when deallocating pages, malloc will fixup other
elements' headers if there is not enough space to store a full
element in leftover space. This leads to race conditions because
there are some functions that check for pad size with an unlocked
heap, expecting pad size to be constant.
Fix it by being more conservative and only freeing pages when
there is enough space before and after the page to store a free
element.
Fixes: 1403f87d4fb8 ("malloc: enable memory hotplug support")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
The pad value is not used unless element is in pad state, but it
will show up in heap dumps and may be confusing.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
After below commit, we encounter some strange issue:
1) Dead lock as described here:
http://dpdk.org/ml/archives/dev/2018-April/099806.html
2) SIGSEGV issue when starting a testpmd in VM.
Considering below commit changes to use dynamic memory instead of
stack for memory barrier, we doubt it's caused by use-after-free.
Fixes: 3d09a6e26d8b ("eal: fix threads block on barrier")
Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reported-by: Lei Yao <lei.a.yao@intel.com>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Suggested-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
params is not freed if pthread_create() fails. The fix is
straight-forward.
Fixes: 3d09a6e26d8b ("eal: fix threads block on barrier")
Reported-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
When heap initializes, we need to add already allocated segments
onto the heap. However, in doing that, we never increased total
heap size. Fix it by adding segment length to total heap length
when initializing the heap.
Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
The original implementation used flock() locks, but was later
switched to using fcntl() locks for page locking, because
fcntl() locks allow locking parts of a file, which is useful
for single-file segments mode, where locking the entire file
isn't as useful because we still need to grow and shrink it.
However, according to fcntl()'s Ubuntu manpage [1], semantics of
fcntl() locks have a giant oversight:
This interface follows the completely stupid semantics of System
V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
locks associated with a file for a given process are removed
when any file descriptor for that file is closed by that process.
This semantic means that applications must be aware of any files
that a subroutine library may access.
Basically, closing *any* fd with an fcntl() lock (which we do because
we don't want to leak fd's) will drop the lock completely.
So, in this commit, we will be reverting back to using flock() locks
everywhere. However, that still leaves the problem of locking parts
of a memseg list file in single file segments mode, and we will be
solving it with creating separate lock files per each page, and
tracking those with flock().
We will also be removing all of this tailq business and replacing it
with a simple array - saving a few bytes is not worth the extra
hassle of dealing with pointers and potential memory allocation
failures. Also, remove the tailq lock since it is not needed - these
fd lists are per-process, and within a given process, it is always
only one thread handling access to hugetlbfs.
So, first one to allocate a segment will create a lockfile, and put
a shared lock on it. When we're shrinking the page file, we will be
trying to take out a write lock on that lockfile, which would fail if
any other process is holding onto the lockfile as well. This way, we
can know if we can shrink the segment file. Also, if no other locks
are found in the lock list for a given memseg list, the memseg list
fd is automatically closed.
One other thing to note is, according to flock() Ubuntu manpage [2],
upgrading the lock from shared to exclusive is implemented by dropping
and reacquiring the lock, which is not atomic and thus would have
created race conditions. So, on attempting to perform operations in
hugetlbfs, we will take out a writelock on hugetlbfs directory, so
that only one process could perform hugetlbfs operations concurrently.
[1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
[2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
Fixes: 2a04139f66b4 ("eal: add single file segments option")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
For non-legacy mode, we are preallocating space for hugepages, so
we know in advance which pages we will be able to allocate, and
which we won't. However, the init procedure was using hugepage
counts gathered from sysfs and paid no attention to hugepage
sizes that were actually available for reservation, and failed
on attempts to reserve unavailable pages.
Fix this by limiting total page counts by number of pages
actually preallocated.
Also, VA preallocate procedure only looks at mountpoints that are
available, and expects pages to exist if a mountpoint exists. That
might not necessarily be the case, so also check if there are
hugepages available for a particular page size on a particular
NUMA node.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
Previously, if we couldn't preallocate VA space on 32-bit for
one page size, we simply bailed out, even though we could've
tried allocating VA space with other page sizes.
For example, if user had both 1G and 2M pages enabled, and
has asked DPDK to allocate memory on both sockets, DPDK
would've tried to allocate VA space for 1x1G page on both
sockets, failed and never tried again, even though it
could've allocated the same 1G of VA space for 512x2M pages.
Fix this by retrying with different page sizes if VA space
reservation failed.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
32-bit mode has an upper limit on amount of VA space it can preallocate,
but the original implementation used the wrong constant, resulting in
failure to initialize due to integer overflow. Fix it by using the
correct constant.
Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
Previous code checked for both first/last elements being NULL,
but if they weren't, the expectation was that they're both
non-NULL, which will be the case under normal conditions, but
may not be the case due to heap structure corruption.
Coverity issue: 272566
Fixes: bb372060dad4 ("malloc: make heap a doubly-linked list")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
Technically, while the pointer would've been invalid if msl_idx
were invalid, we wouldn't have actually attempted to access the
pointer until verifying the index. Fix it by moving array access
to after we've verified validity of the index.
Coverity issue: 272574
Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
If user has specified a flag to unmap the area right after mapping it,
we were passing an already-unmapped pointer to RTE_LOG. This is not an
issue since RTE_LOG doesn't actually dereference the pointer, but fix
it anyway by moving call to RTE_LOG to before unmap.
Coverity issue: 272584
Fixes: b7cc54187ea4 ("mem: move virtual area function in common directory")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Below commit introduced pthread barrier for synchronization.
But two IPC threads block on the barrier, and never wake up.
(gdb) bt
#0 futex_wait (private=0, expected=0, futex_word=0x7fffffffcff4)
at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1 futex_wait_simple (private=0, expected=0, futex_word=0x7fffffffcff4)
at ../sysdeps/nptl/futex-internal.h:135
#2 __pthread_barrier_wait (barrier=0x7fffffffcff0) at pthread_barrier_wait.c:184
#3 rte_thread_init (arg=0x7fffffffcfe0)
at ../dpdk/lib/librte_eal/common/eal_common_thread.c:160
#4 start_thread (arg=0x7ffff6ecf700) at pthread_create.c:333
#5 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Through analysis, we find the barrier defined on the stack could be the
root cause. This patch will change to use heap memory as the barrier.
Fixes: d651ee4919cd ("eal: set affinity for control threads")
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
This patch adds APIs to support container create/destroy and device
bind/unbind with a container. It also provides API for IOMMU programing
on a specified container.
A driver could use "rte_vfio_container_create" helper to create a new
container from eal, use "rte_vfio_container_group_bind" to bind a device
to the newly created container. During rte_vfio_setup_device the container
bound with the device will be used for IOMMU setup.
Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
The auxiliary vector read is implemented only for Linux.
It could be done with procstat_getauxv() for FreeBSD.
Since the commit below, the auxiliary vector functions
are compiled for every architectures, including x86
which is tested with FreeBSD.
This patch is moving the Linux implementation in Linux directory,
and adding a fake/empty implementation for FreeBSD.
Fixes: 2ed9bf330709 ("eal: abstract away the auxiliary vector")
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
The fake getauxval function does not use its parameter.
So the compiler raised this error:
lib/librte_eal/common/eal_common_cpuflags.c:25:25: error:
unused parameter 'type'
Fixes: 2ed9bf330709 ("eal: abstract away the auxiliary vector")
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
rte_lcore_has_role() returns 0 if role of lcore matches requested
role. The return value of the API is confusing, and this is a known
problem with a deprecation notice announcing the change to more
intuitive semantics:
Commit 064518f68d48 ("doc: announce EAL API change to lcore role function")
Implement changes announced in the deprecation notice, and remove it.
Also, fix usages of this API to reflect the change. Control thread patches
expected new behavior and were broken before, now they are fixed as well.
Fixes: d651ee4919cd ("eal: set affinity for control threads")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Signed-off-by: Erik Gabriel Carrillo <erik.g.carrillo@intel.com>
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
This commit removes the experimental tags from the
service cores functions, they now become part of the
main DPDK API/ABI.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Regular expressions are not the best way to match a hierarchical
pattern like dynamic log levels. And the separator for dynamic
log levels is period which is the regex wildcard character.
A better solution is to use filename matching 'globbing' so
that log levels match like file paths. For compatibility,
use colon to separate pattern match style arguments. For
example:
--log-level 'pmd.net.virtio.*:debug'
This also makes the documentation match what really happens
internally.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
We don't want format of eal log level saved values to be visible
in ABI. Move to private storage in eal_common_log.
Includes minor optimization. Compile the regular expression for
each log match once, rather than each time it is used.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Rather than attempting to load the contents of the auxv directly,
prefer to use an exposed API - and if that doesn't exist then attempt
to load the vector. This is because on some systems, when a user
is downgraded, the /proc/self/auxv file retains the old ownership
and permissions. The original method of /proc/self/auxv is retained.
This also removes a potential abort() in the code when compiled with
NDEBUG. A quick parse of the code shows that many (if not all) of
the CPU flag parsing isn't used internally, so it should be okay.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Add the priority RTE_PRIORITY_LAST, used for initialization routines
meant to be run after all other constructors.
This priority becomes the default priority for all DPDK constructors.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Build a central list to quickly see each used priorities for
constructors, allowing to verify that they are both above 100 and in the
proper order.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
The previous symbols were deprecated for two releases.
They are now marked as such and cannot be used anymore.
They are replaced by ones respecting the new namespace that are marked
experimental.
As a result, eth_dev attach and detach are slightly reworked to follow
the changes.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
rte_eal_devargs is useless, rte_devargs is sufficient.
Only experimental functions are changed for now.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
rte_eal_devargs_parse can be used by EAL subsystems, drivers,
applications alike.
Device parameters may be presented with different structure each time;
as a single declaration string or several strings each describing
different parts of the declaration.
To simplify the use of this parsing facility, its parameters are made
variadic.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Initially, rte_devargs was meant to be populated once and sometimes
accessed, then never emptied.
With the new hotplug functionality having better standing, new usage
appeared with repeated addition of devices and their subsequent removal.
Exposing devargs_list pushed bus drivers and libraries to be careless
and inconsistent in their memory management. Making it private will
allow to rationalize this part of the EAL and ensure that fewer memory
leaks occur during operations.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
In preparation to making devargs_list private.
Bus drivers generally need to access rte_devargs pertaining to their
operations. This match is a common operation for bus drivers.
Add a new accessor for the rte_devargs list.
Signed-off-by: Gaetan Rivet <gaetan.rivet@6wind.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>