The Linux kernel module vfio-pci introduces the VF token to enable
SR-IOV support since 5.7.
The VF token can be set by a vfio-pci based PF driver and must be known
by the vfio-pci based VF driver in order to gain access to the device.
Since the vfio-pci module uses the VF token as internal data to provide
the collaboration between SR-IOV PF and VFs, so DPDK can use the same
VF token for all PF devices by specifying the related EAL option.
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
Tested-by: Harman Kalra <hkalra@marvell.com>
The startup of VFIO is too noisy. Logging is expensive on some
systems, and distracting to the user.
It should not be logging at NOTICE level, reduce it to INFO level.
It really should be DEBUG here but that would hide it by default.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
The 'group_status' has never been used and can be removed.
Fixes: 94c0776b1b ("vfio: support hotplug")
Cc: stable@dpdk.org
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Move common functions between Unix and Windows to eal_common_options.c.
Those functions are getter functions for rte_application_usage_hook.
Signed-off-by: Tal Shnaiderman <talshn@mellanox.com>
Move common functions between Unix and Windows to eal_common_config.c.
Those functions are getter functions for IOVA,
configuration, Multi-process.
Move rte_config, internal_config, early_mem_config and runtime_dir
to be defined in the common file with getter functions.
Refactor the users of the config variables above to use
the getter functions.
Signed-off-by: Tal Shnaiderman <talshn@mellanox.com>
Move common functions between Unix and Windows to eal_common_debug.c.
Those functions are rte_exit, __rte_panic and rte_dump_registers
which has the same implementation on Unix and Windows.
Signed-off-by: Tal Shnaiderman <talshn@mellanox.com>
An issue has been observed where epoll file descriptor
list rebuilds every time an interrupt/alarm event is
received.
eal_intr_process_interrupts() should notify pipe fd only
if any source is removed from the source list i.e (rv > 0)
Fixes: 0c7ce182a7 ("eal: add pending interrupt callback unregister")
Cc: stable@dpdk.org
Signed-off-by: Harman Kalra <hkalra@marvell.com>
EAL common timer doesn't compile under Windows.
Compilation log:
error LNK2019:
unresolved external symbol nanosleep referenced in function
rte_delay_us_sleep
error LNK2019:
unresolved external symbol get_tsc_freq referenced in function set_tsc_freq
error LNK2019:
unresolved external symbol sleep referenced in function set_tsc_freq
The reason was that some functions called POSIX functions.
The solution was to move POSIX dependent functions from common to Unix.
Signed-off-by: Fady Bader <fady@mellanox.com>
Reviewed-by: Tal Shnaiderman <talshn@mellanox.com>
Acked-by: Ranjit Menon <ranjit.menon@intel.com>
Code in Linux EAL that supports dynamic memory allocation (as opposed to
static allocation used by FreeBSD) is not OS-dependent and can be reused
by Windows EAL. Move such code to a file compiled only for the OS that
require it. Keep Anatoly Burakov maintainer of extracted code.
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
All supported OS create memory segment lists (MSL) and reserve VA space
for them in a nearly identical way. Move common code into EAL private
functions to reduce duplication.
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Introduce OS-independent wrappers for memory management operations used
across DPDK and specifically in common code of EAL:
* rte_mem_map()
* rte_mem_unmap()
* rte_mem_page_size()
* rte_mem_lock()
Windows uses different APIs for memory mapping and reservation, while
Unices reserve memory by mapping it. Introduce EAL private functions to
support memory reservation in common code:
* eal_mem_reserve()
* eal_mem_free()
* eal_mem_set_dump()
Wrappers follow POSIX semantics limited to DPDK tasks, but their
signatures deliberately differ from POSIX ones to be more safe and
expressive. New symbols are internal. Being thin wrappers, they require
no special maintenance.
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Introduce OS-independent wrappers in order to support common EAL code
on Unix and Windows:
* eal_file_open: open or create a file.
* eal_file_lock: lock or unlock an open file.
* eal_file_truncate: enforce a given size for an open file.
Implementation for Linux and FreeBSD is placed in "unix" subdirectory,
which is intended for common code between the two. These thin wrappers
require no special maintenance.
Common code supporting multi-process doesn't use the new wrappers,
because it is inherently Unix-specific and would impose excessive
requirements on the wrappers.
Signed-off-by: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Initially, printf was used to indicate and error/warning resulting from
telemetry initialisation. This is now fixed to use EAL logs for
notices, and the unnecessary printf for an error is removed.
Fixes: eeb486f3ba ("eal: add telemetry as dependency")
Fixes: dd6275a424 ("telemetry: fix error log output")
Signed-off-by: Ciara Power <ciara.power@intel.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
The threads for listening on the telemetry sockets are control threads
and should be separated from those on the data plane. Since telemetry
cannot use the rte_ctrl_thread_create() API, as it does not depend on
EAL, we pass the ctrl thread cpu_set to telemetry init and use it
directly to ensure that telemetry cannot interfere with the data plane
threads.
Signed-off-by: Ciara Power <ciara.power@intel.com>
Acked-by: Kevin Laatz <kevin.laatz@intel.com>
As Telemetry no longer uses rte_option, and was the only user of this
infrastructure, it can now be removed.
Signed-off-by: Ciara Power <ciara.power@intel.com>
Reviewed-by: Keith Wiles <keith.wiles@intel.com>
This patch moves telemetry further down the build, and adds it as a
dependency for EAL. Telemetry V2 is now configured to build by default,
and the legacy support is built when the telemetry config flag is set.
Telemetry now has EAL flags, shown below:
"--telemetry" = Enables telemetry (this is default if no flags given)
"--no-telemetry" = Disables telemetry
When telemetry is enabled, it will attempt to open the new socket
version, and also the legacy support socket (this will depend on Jansson
external dependency and telemetry config flag, as before).
Signed-off-by: Ciara Power <ciara.power@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Reviewed-by: Keith Wiles <keith.wiles@intel.com>
Currently, even though memory is mapped with PROT_NONE, this does not
cause it to be excluded from core dumps. This is counter-productive,
because in a lot of cases, this memory will go unused (e.g. when the
memory subsystem preallocates VA space but hasn't yet mapped physical
pages into it).
Use `madvise()` call with MADV_DONTDUMP/MADV_NOCORE to exclude the
unused memory from being dumped.
Signed-off-by: Li Feng <fengli@smartx.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Commit 8a4baf06c1 ("mem: mark pages as not accessed when reserving VA")
has mapped the initialized memory with PROT_NONE, and when it's unmapped,
eal_memalloc.c should remmap the anonymous memory with PROT_NONE too.
Fixes: 8a4baf06c1 ("mem: mark pages as not accessed when reserving VA")
Cc: stable@dpdk.org
Signed-off-by: Li Feng <fengli@smartx.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Add the following interrupt related tracepoints.
- rte_eal_trace_intr_callback_register()
- rte_eal_trace_intr_callback_unregister()
- rte_eal_trace_intr_enable()
- rte_eal_trace_intr_disable()
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Add the following thread related tracepoints.
- rte_eal_trace_thread_remote_launch()
- rte_eal_trace_thread_lcore_ready()
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Add following alarm related trace points.
- rte_eal_trace_alarm_set()
- rte_eal_trace_alarm_cancel()
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
This patch creates the following generic tracepoint for
generic tracing when there is no dedicated tracepoint is
available.
- rte_eal_trace_generic_void()
- rte_eal_trace_generic_u64()
- rte_eal_trace_generic_u32()
- rte_eal_trace_generic_u16()
- rte_eal_trace_generic_u8()
- rte_eal_trace_generic_i64()
- rte_eal_trace_generic_i32()
- rte_eal_trace_generic_i16()
- rte_eal_trace_generic_i8()
- rte_eal_trace_generic_int()
- rte_eal_trace_generic_long()
- rte_eal_trace_generic_float()
- rte_eal_trace_generic_double()
- rte_eal_trace_generic_ptr()
- rte_eal_trace_generic_str()
For example, if an application wishes to emit an int datatype,
it can call rte_eal_trace_generic_int(val) to emit the trace.
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Connect the internal trace interface API to Linux EAL.
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Common trace format(CTF) defines the metadata[1][2] for trace events,
This patch creates the metadata for the DPDK events in memory and
later this will be saved to trace directory on rte_trace_save()
invocation.
[1] https://diamon.org/ctf/#specification
[2] https://diamon.org/ctf/#examples
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Define eal_trace_init() and eal_trace_fini() EAL interface
functions that rte_eal_init() and rte_eal_cleanup() function can
use to initialize and finalize the trace subsystem.
eal_trace_init() function will add the following functionality if
trace is enabled through EAL command line param.
- Test for trace registration failure.
- Test for duplicate trace name registration.
- Generate UUID ver 4.
- Create a trace directory.
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Define the public API for trace support.
This patch also adds support for the build infrastructure and
update the MAINTAINERS file for the trace subsystem.
The 8 bytes tracepoint object is a global variable, and can be used in
fast path. Created a new __rte_trace_point section to store the
tracepoint objects as,
- It is a mostly read-only data and not to mix with other "write"
global variables.
- Chances that the same subsystem fast path variables come in the same
fast path cache line. i.e, it will enable a more predictable
performance number from build to build.
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Introduce rte_thread_getname() API to get the thread name
and implement it for Linux and FreeBSD.
FreeBSD does not support getting the thread name.
One of the consumers of this API will be the trace subsystem where
it used as an informative purpose.
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: David Marchand <david.marchand@redhat.com>
This patch fixes the heap-use-after-free bug which was found by ASAN
(Address-Sanitizer) in the vfio_get_default_container_fd function.
Fixes: 6bcb7c95fe ("vfio: share default container in multi-process")
Cc: stable@dpdk.org
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
This fix treats a 0 return value from vfio_open_group_fd
in vfio_get_group_fd as the intended error condition instead
of putting an incorrect 0 file descriptor in the vfio_group table.
Sometimes, the creation of device files in sysfs is not
instantaneously causing vfio_open_groupfd to return 0.
This has been observed when hot removing/adding multiple
NVMe devices (>=4).
Fixes: 340b7bb8d5 ("vfio: extend data structure for multi container")
Cc: stable@dpdk.org
Signed-off-by: Michael Haeuptle <michael.haeuptle@hpe.com>
Acked-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
The new macro __rte_noreturn, for compiler hinting,
is now used where appropriate for consistency.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
There is a common macro __rte_unused, avoiding warnings,
which is now used where appropriate for consistency.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Remove setting ALLOW_EXPERIMENTAL_API individually for each Makefile and
meson.build. Instead, enable ALLOW_EXPERIMENTAL_API flag across app, lib
and drivers.
This changes reduces the clutter across the project while still
maintaining the functionality of ALLOW_EXPERIMENTAL_API i.e. warning
external applications about experimental API usage.
Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Clean up indent and line ordering in Makefile and meson.build
for consistency in linux/ and freebsd/ directories.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: David Marchand <david.marchand@redhat.com>
Since the kernel modules are moved to kernel/ directory,
there is no need anymore for the sub-directory eal/ in
linux/, freebsd/ and windows/.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: David Marchand <david.marchand@redhat.com>
The EAL API (with doxygen documentation) is moved from
common/include/ to include/, which makes more clear that
it is the global API for all environments and architectures.
Note that the arch-specific and OS-specific include files are not
in this global include directory, but include/generic/ should
cover the doxygen documentation for them.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: David Marchand <david.marchand@redhat.com>
The arch-specific directories arm, ppc and x86 in common/arch/
are moved at the same level as the OS-specific directories.
It makes more clear that EAL is covering a matrix combining OS and arch.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: David Marchand <david.marchand@redhat.com>
When moving files to the directory kernel/,
the file BSDmakefile.meson was left in eal/.
Also the intermediate makefiles in linux/ and freebsd/ became useless.
Fixes: acaa9ee991 ("move kernel modules directories")
Cc: stable@dpdk.org
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: David Marchand <david.marchand@redhat.com>
When --no-huge mode is used, the memory is currently allocated with
mmap(NULL, ...). This is fine in most cases, but can fail in cases
where DPDK is run on a machine with an IOMMU that is of more limited
address width than that of a VA, because we're not specifying the
address hint for mmap() call.
Fix it by preallocating VA space before mapping it.
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: David Marchand <david.marchand@redhat.com>
Tested-by: Jun W Zhou <junx.w.zhou@intel.com>
Currently, when we are creating DMA mappings for memory that's
either external or is backed by hugepages in IOVA as PA mode, we
assume that each page is necessarily discontiguous. This may not
actually be the case, especially for external memory, where the
user is able to create their own IOVA table and make it
contiguous. This is a problem because VFIO has a limited number
of DMA mappings, and it does not appear to concatenate them and
treats each mapping as separate, even when they cover adjacent
areas.
Fix this so that we always map contiguous memory in a single
chunk, as opposed to mapping each segment separately.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Ray Kinsella <ray.kinsella@intel.com>
Added an API to check if current execution is in interrupt
context. This will be helpful to handle nested interrupt cases.
Signed-off-by: Harman Kalra <hkalra@marvell.com>
Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
The loop to unwind existing mmaps was only unmapping the
first segment and the error paths after mmap() were not
doing munmap of the current segment.
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Cc: stable@dpdk.org
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
ppc64le failed when using large physical memory. I found problems in my two
commits in the past.
In commit e072d16f89 ("vfio: fix expanding DMA area in ppc64le"), I added
a sanity check using a mapped address to resolve an issue around expanding
IOMMU window, but this was not enough, since memory allocation can return
memory anywhere dependent on memory fragmentation. DPDK may still skip DMA
mapping and attempts to unmap non-mapped DMA during expanding IOMMU window.
As a result, SPDK apps using large physical memory frequently failed to
proceed the communication with NVMe and/or went into an infinite loop.
The root cause of the bug was in a gap between memory segments managed by
DPDK and firmware-level DMA mapping. DPDK's memory segments don't contain
the state of DMA mapping, and so, the memesg_walk cannot determine if an
iterated memory segment is mapped or not. This resulted in incorrect DMA
maps and unmaps.
At this time, I added the code to avoid iterating non-mapped memory
segments during DMA mapping. The memseg_walk iterates over memory segments
marked as "used", and so, the code sets memory segments that will be
mapped or unmapped as "free" transiently.
The commit db90b4969e ("vfio: retry creating sPAPR DMA window") allows
retring different page levels and sizes to create DMA window. However, this
allows page sizes different from hugepage sizes. This inconsistency caused
failures at the time of DMA mapping after the window creation. This patch
fixes to retry only different page levels.
Fixes: e072d16f89 ("vfio: fix expanding DMA area in ppc64le")
Fixes: db90b4969e ("vfio: retry creating sPAPR DMA window")
Cc: stable@dpdk.org
Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>
The header linux/version.h isn't included when CONFIG_RTE_EAL_VFIO
is explicitly disabled. LINUX_VERSION_CODE and KERNEL_VERSION are
therefore undefined, causing the build failure:
lib/librte_eal/linux/eal/eal.c: In function ‘rte_eal_init’:
lib/librte_eal/linux/eal/eal.c:1076:32: error: "LINUX_VERSION_CODE" is
not defined, evaluates to 0 [-Werror=undef]
Fixes: a0dede62a5 ("eal/linux: remove KNI restriction on IOVA")
Cc: stable@dpdk.org
Signed-off-by: Ali Alnubani <alialnu@mellanox.com>
Previous fix gives hiccups to gcc on RHEL 7.6:
== Build lib/librte_eal/linux/eal
CC eal_interrupts.o
...lib/librte_eal/linux/eal/eal_interrupts.c: In function
‘eal_intr_thread_main’:
...lib/librte_eal/linux/eal/eal_interrupts.c:1048:9: error: missing
initializer for field ‘events’ of ‘struct epoll_event’
[-Werror=missing-field-initializers]
struct epoll_event ev = { };
^
In file included from ...lib/librte_eal/linux/eal/eal_interrupts.c:15:0:
/usr/include/sys/epoll.h:89:12: note: ‘events’ declared here
uint32_t events; /* Epoll events */
^
...lib/librte_eal/linux/eal/eal_interrupts.c: At top level:
cc1: error: unrecognized command line option
"-Wno-address-of-packed-member" [-Werror]
cc1: all warnings being treated as errors
Fixes: e0ab8020ac ("eal/linux: fix uninitialized data valgrind warning")
Cc: stable@dpdk.org
Reported-by: Andrew Rybchenko <arybchenko@solarflare.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Valgrind reports that eal interrupt thread is calling epoll_ctl
with uninitialized data.
This is a false positive, because the kernel is not going to care about
the unused bits in the union but trivial to fix by initializing it.
Fixes: af75078fec ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David Marchand <david.marchand@redhat.com>
The 'get_user_pages_remote()' API is updated in kernel 4.10.0 [1],
but the check added as > 4.9.0,
this logic is broken for kernels 4.9.x, because they justify
> 4.9.0 check but have the old API.
Fixing the check as >= 4.10.0
[1]
commit 5b56d49fc31d ("mm: add locked parameter to get_user_pages_remote()")
Fixes: d965af9e8a ("kni: increase kernel version requirement for VA")
Reported-by: Andrew Rybchenko <arybchenko@solarflare.com>
Suggested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
A build error reported related to the selected 'get_user_pages_remote()'
kernel API:
.../kernel/linux/kni/kni_dev.h:113:8:
error: too few arguments to function ‘get_user_pages_remote’
ret = get_user_pages_remote(tsk, tsk->mm, iova, 1
^~~~~~~~~~~~~~~~~~~~~
Currently there are three versions of the 'get_user_pages_remote()'
supported, based on kernel version < 4.9, = 4.9, > 4.9.
These version based checks are not working fine with the distro kernels
which is the cause of reported build error. The error reported by the
kernel version 4.8, but it is using API defined in > 4.9.
To be able to take control of this, and possible more, related build
error, increasing the minimum supported kernel version for iova=va with
KNI to kernel version 4.9.
This leaves us with single version of the kernel API and more manageable.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>