numam-dpdk

Author	SHA1	Message	Date
Anatoly Burakov	4236694f0a	mem: preallocate VA space in no-huge mode When --no-huge mode is used, the memory is currently allocated with mmap(NULL, ...). This is fine in most cases, but can fail in cases where DPDK is run on a machine with an IOMMU that is of more limited address width than that of a VA, because we're not specifying the address hint for mmap() call. Fix it by preallocating VA space before mapping it. Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: David Marchand <david.marchand@redhat.com> Tested-by: Jun W Zhou <junx.w.zhou@intel.com>	2020-03-27 11:04:09 +01:00
Anatoly Burakov	d1c7c0cdf7	vfio: map contiguous areas in one go Currently, when we are creating DMA mappings for memory that's either external or is backed by hugepages in IOVA as PA mode, we assume that each page is necessarily discontiguous. This may not actually be the case, especially for external memory, where the user is able to create their own IOVA table and make it contiguous. This is a problem because VFIO has a limited number of DMA mappings, and it does not appear to concatenate them and treats each mapping as separate, even when they cover adjacent areas. Fix this so that we always map contiguous memory in a single chunk, as opposed to mapping each segment separately. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Ray Kinsella <ray.kinsella@intel.com>	2020-03-27 10:09:22 +01:00
Harman Kalra	9058afaa26	eal: check if running in interrupt context Added an API to check if current execution is in interrupt context. This will be helpful to handle nested interrupt cases. Signed-off-by: Harman Kalra <hkalra@marvell.com> Signed-off-by: Sunil Kumar Kori <skori@marvell.com> Reviewed-by: Jerin Jacob <jerinj@marvell.com>	2020-02-06 16:35:40 +01:00
Stephen Hemminger	292f02b58c	mem: fix munmap in error unwind The loop to unwind existing mmaps was only unmapping the first segment and the error paths after mmap() were not doing munmap of the current segment. Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists") Cc: stable@dpdk.org Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2020-02-06 15:39:30 +01:00
Takeshi Yoshimura	986f2134c3	vfio: fix mapping failures in ppc64le ppc64le failed when using large physical memory. I found problems in my two commits in the past. In commit e072d16f8920 ("vfio: fix expanding DMA area in ppc64le"), I added a sanity check using a mapped address to resolve an issue around expanding IOMMU window, but this was not enough, since memory allocation can return memory anywhere dependent on memory fragmentation. DPDK may still skip DMA mapping and attempts to unmap non-mapped DMA during expanding IOMMU window. As a result, SPDK apps using large physical memory frequently failed to proceed the communication with NVMe and/or went into an infinite loop. The root cause of the bug was in a gap between memory segments managed by DPDK and firmware-level DMA mapping. DPDK's memory segments don't contain the state of DMA mapping, and so, the memesg_walk cannot determine if an iterated memory segment is mapped or not. This resulted in incorrect DMA maps and unmaps. At this time, I added the code to avoid iterating non-mapped memory segments during DMA mapping. The memseg_walk iterates over memory segments marked as "used", and so, the code sets memory segments that will be mapped or unmapped as "free" transiently. The commit db90b4969e2e ("vfio: retry creating sPAPR DMA window") allows retring different page levels and sizes to create DMA window. However, this allows page sizes different from hugepage sizes. This inconsistency caused failures at the time of DMA mapping after the window creation. This patch fixes to retry only different page levels. Fixes: e072d16f8920 ("vfio: fix expanding DMA area in ppc64le") Fixes: db90b4969e2e ("vfio: retry creating sPAPR DMA window") Cc: stable@dpdk.org Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com> Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>	2020-02-05 21:57:21 +01:00
Ali Alnubani	e8a17faa5e	eal/linux: fix build when VFIO is disabled The header linux/version.h isn't included when CONFIG_RTE_EAL_VFIO is explicitly disabled. LINUX_VERSION_CODE and KERNEL_VERSION are therefore undefined, causing the build failure: lib/librte_eal/linux/eal/eal.c: In function ‘rte_eal_init’: lib/librte_eal/linux/eal/eal.c:1076:32: error: "LINUX_VERSION_CODE" is not defined, evaluates to 0 [-Werror=undef] Fixes: a0dede62a537 ("eal/linux: remove KNI restriction on IOVA") Cc: stable@dpdk.org Signed-off-by: Ali Alnubani <alialnu@mellanox.com>	2020-01-20 00:08:53 +01:00
David Marchand	aef1d07331	eal/linux: fix build error on RHEL 7.6 Previous fix gives hiccups to gcc on RHEL 7.6: == Build lib/librte_eal/linux/eal CC eal_interrupts.o ...lib/librte_eal/linux/eal/eal_interrupts.c: In function ‘eal_intr_thread_main’: ...lib/librte_eal/linux/eal/eal_interrupts.c:1048:9: error: missing initializer for field ‘events’ of ‘struct epoll_event’ [-Werror=missing-field-initializers] struct epoll_event ev = { }; ^ In file included from ...lib/librte_eal/linux/eal/eal_interrupts.c:15:0: /usr/include/sys/epoll.h:89:12: note: ‘events’ declared here uint32_t events; /* Epoll events */ ^ ...lib/librte_eal/linux/eal/eal_interrupts.c: At top level: cc1: error: unrecognized command line option "-Wno-address-of-packed-member" [-Werror] cc1: all warnings being treated as errors Fixes: e0ab8020ac2a ("eal/linux: fix uninitialized data valgrind warning") Cc: stable@dpdk.org Reported-by: Andrew Rybchenko <arybchenko@solarflare.com> Signed-off-by: David Marchand <david.marchand@redhat.com>	2019-12-04 14:22:02 +01:00
Stephen Hemminger	e0ab8020ac	eal/linux: fix uninitialized data valgrind warning Valgrind reports that eal interrupt thread is calling epoll_ctl with uninitialized data. This is a false positive, because the kernel is not going to care about the unused bits in the union but trivial to fix by initializing it. Fixes: af75078fece3 ("first public release") Cc: stable@dpdk.org Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Marchand <david.marchand@redhat.com>	2019-12-04 10:05:05 +01:00
Ferruh Yigit	de480bbf13	kni: fix build with Linux 4.9.x The 'get_user_pages_remote()' API is updated in kernel 4.10.0 [1], but the check added as > 4.9.0, this logic is broken for kernels 4.9.x, because they justify > 4.9.0 check but have the old API. Fixing the check as >= 4.10.0 [1] commit 5b56d49fc31d ("mm: add locked parameter to get_user_pages_remote()") Fixes: d965af9e8ae1 ("kni: increase kernel version requirement for VA") Reported-by: Andrew Rybchenko <arybchenko@solarflare.com> Suggested-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com> Tested-by: Andrew Rybchenko <arybchenko@solarflare.com> Reviewed-by: David Marchand <david.marchand@redhat.com>	2019-11-28 14:48:24 +01:00
Ferruh Yigit	d965af9e8a	kni: increase kernel version requirement for VA A build error reported related to the selected 'get_user_pages_remote()' kernel API: .../kernel/linux/kni/kni_dev.h:113:8: error: too few arguments to function ‘get_user_pages_remote’ ret = get_user_pages_remote(tsk, tsk->mm, iova, 1 ^~~~~~~~~~~~~~~~~~~~~ Currently there are three versions of the 'get_user_pages_remote()' supported, based on kernel version < 4.9, = 4.9, > 4.9. These version based checks are not working fine with the distro kernels which is the cause of reported build error. The error reported by the kernel version 4.8, but it is using API defined in > 4.9. To be able to take control of this, and possible more, related build error, increasing the minimum supported kernel version for iova=va with KNI to kernel version 4.9. This leaves us with single version of the kernel API and more manageable. Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com>	2019-11-21 00:18:02 +01:00
Anatoly Burakov	fbaf943887	build: remove individual library versions Since the library versioning for both stable and experimental ABI's is now managed globally, the LIBABIVER and version variables no longer serve any useful purpose, and can be removed. The replacement in Makefiles was done using the following regex: ^(#.\n)?LIBABIVER\s:=\s\d+\n(\s\n)? (LIBABIVER := numbers, optionally preceded by a comment and optionally succeeded by an empty line) The replacement for meson files was done using the following regex: ^(#.\n)?version\s=\s\d+\n(\s\n)? (version = numbers, optionally preceded by a comment and optionally succeeded by an empty line) [David]: those variables are manually removed for the files: - drivers/common/qat/Makefile - lib/librte_eal/meson.build [David]: the LIBABIVER is restored for the external ethtool example library. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2019-11-20 23:05:39 +01:00
Kevin Traynor	0411d61fa9	lib: fix log typos Fix these as they are user visible. Found with codespell. Fixes: bacaa2754017 ("eal: add channel for multi-process communication") Fixes: f05e26051c15 ("eal: add IPC asynchronous request") Fixes: 0cbce3a167f1 ("vfio: skip DMA map failure if already mapped") Fixes: 445c6528b55f ("power: common interface for guest and host") Fixes: e6c6dc0f96c8 ("power: add p-state driver compatibility") Fixes: 8f972312b8f4 ("vhost: support vhost-user") Cc: stable@dpdk.org Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com>	2019-11-19 22:03:27 +01:00
Michael Pfeiffer	b8a0415008	kni: reduce interface name size The name in rte_kni_device_info is passed to the kernel, which allows interface names with at most 16 bytes (IFNAMSIZ). rte_kni_alloc with a longer name currently trigger a kernel BUG in alloc_netdev_mqs in net/core/dev.c. Reduce RTE_KNI_NAMESIZE to prevent this situation. Signed-off-by: Michael Pfeiffer <michael.pfeiffer@tu-ilmenau.de> Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>	2019-11-19 22:00:32 +01:00
Vamsi Attunuru	a0dede62a5	eal/linux: remove KNI restriction on IOVA Now that KNI supports VA (with kernel versions starting 4.6.0), we can accept IOVA as VA, but KNI must be configured for this. Pass iova_mode when creating KNI netdevs. So far, IOVA detection policy forced IOVA as PA when KNI is loaded, whatever the buses IOVA requirements were. We can now use IOVA as VA, but this comes with a cost in KNI. When no constraint is expressed by the buses, keep the current behavior of choosing PA. Note: this change supposes that dpdk is built on the same kernel than the target system kernel; no objection has been expressed on this topic. Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Jerin Jacob <jerinj@marvell.com>	2019-11-18 16:00:51 +01:00
Vamsi Attunuru	e73831dc6c	kni: support userspace VA Patch adds support for kernel module to work in IOVA = VA mode by providing address translation routines to convert userspace VA to kernel VA. KNI performance using PA is not changed by this patch. But comparing KNI using PA to KNI using VA, the latter will have lower performance due to the cost of the added translation. This translation is implemented only with kernel versions starting 4.6.0. Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com> Reviewed-by: Jerin Jacob <jerinj@marvell.com>	2019-11-18 16:00:51 +01:00
Anatoly Burakov	47c45a4df6	vfio: fix DMA mapping of external heaps Currently, externally created heaps are supposed to be automatically mapped for VFIO DMA by EAL, however they only do so if, at the time of heap creation, VFIO is initialized and has at least one device available. If no devices are available at the time of heap creation (or if devices were available, but were since hot-unplugged, thus dropping all VFIO container mappings), then VFIO mapping code would have skipped over externally allocated heaps. The fix is two-fold. First, we allow externally allocated memory segments to be marked as "heap" segments. This allows us to distinguish between external memory segments that were created via heap API, from those that were created via rte_extmem_register() API. Then, we fix the VFIO code to only skip non-heap external segments. Also, since external heaps are not guaranteed to have valid IOVA addresses, we will skip those which have invalid IOVA addresses as well. Fixes: 0f526d674f8e ("malloc: separate creating memseg list and malloc heap") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Rajesh Ravi <rajesh.ravi@broadcom.com> Acked-by: David Marchand <david.marchand@redhat.com>	2019-11-07 17:46:43 +01:00
Anatoly Burakov	b14d192ca1	vfio: remove deprecated DMA mapping functions The rte_vfio_dma_map/unmap API's have been marked as deprecated in release 19.05. Remove them. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2019-11-07 17:46:43 +01:00
Anatoly Burakov	9362945d7e	vfio: fix DMA mapping with default container When requesting DMA mapping to default container, we are meant to supply the RTE_VFIO_DEFAULT_CONTAINER_FD value, however this is not handled correctly by get_vfio_cfg_by_container_fd(), because it only looks at actual fd values and does not check for this special case. Fix it to return default container if the fd requested is the special RTE_VFIO_DEFAULT_CONTAINER_FD value. Fixes: 4106d89a18f8 ("vfio: allow DMA map to the default container") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Shahaf Shuler <shahafs@mellanox.com>	2019-11-07 17:46:43 +01:00
Igor Ryzhov	49e7e2dee3	kni: add ability to set min/max MTU Starting with kernel version 4.10, there are new min/max MTU values in net_device structure, which are set to ETH_MIN_MTU and ETH_DATA_LEN by default. We should be able to change these values to allow MTU more than 1500 to be set on KNI. Signed-off-by: Igor Ryzhov <iryzhov@nfware.com> Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>	2019-10-27 11:07:43 +01:00
David Marchand	6614072791	eal: factorize lcore role code This code belongs to the lcore API, move the prototype to the right header, then factorize the code into the common code. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Thomas Monjalon <thomas@monjalon.net>	2019-10-27 10:41:08 +01:00
Stephen Hemminger	65661351ca	eal: make lcore config private The internal structure of lcore_config does not need to be part of visible API/ABI. Make it private to EAL. Rearrange the structure so it takes less memory (and cache footprint). Since we change the ABI, bump the library version. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Thomas Monjalon <thomas@monjalon.net> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-10-27 10:35:11 +01:00
Anatoly Burakov	6d3f9917ff	eal: fix memory config allocation for multi-process Currently, mem config will be mapped without using the virtual area reservation infrastructure, which means it will be mapped at an arbitrary location. This may cause failures to map the shared config in secondary process due to things like PCI whitelist arguments allocating memory in a space where the primary has allocated the shared mem config. Fix this by using virtual area reservation to reserve space for the mem config, thereby avoiding the problem and reserving the shared config (hopefully) far away from any normal memory allocations. Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-10-26 18:03:26 +02:00
Anatoly Burakov	6080796f65	mem: make base address hint OS specific Not all OS's follow Linux's memory layout, which may lead to problems following the suggested common address hint absent of a base-virtaddr flag. Make this address hint OS-specific. Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-10-26 18:03:24 +02:00
Pallavi Kadam	7e708cd8c6	eal: move CPU operations to OS specific headers Moving RTE_CPU* definitions from the common code to the Linux and FreeBSD rte_os.h file to avoid #ifdef clutter. Signed-off-by: Pallavi Kadam <pallavi.kadam@intel.com> Signed-off-by: Antara Ganesh Kolar <antara.ganesh.kolar@intel.com> Reviewed-by: Ranjit Menon <ranjit.menon@intel.com> Reviewed-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: David Marchand <david.marchand@redhat.com>	2019-10-26 17:06:41 +02:00
Anatoly Burakov	3fe4bced1b	eal: use define instead of raw option name We are using '--base-virtaddr' in a few places. We have a define for that, so use it instead. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com>	2019-10-25 11:35:10 +02:00
Anatoly Burakov	8f29a60764	eal/freebsd: support option --base-virtaddr According to our docs, only Linuxapp supports base-virtaddr option. That is, strictly speaking, not true because most of the things that are attempting to respect base-virtaddr are in common files, so FreeBSD already mostly supports this option in practice. This commit fixes the remaining bits to explicitly support base-virtaddr option, and moves the arg parsing from EAL to common options parsing code. Documentation is also updated to reflect that all platforms now support base-virtaddr. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com>	2019-10-25 11:17:29 +02:00
David Christensen	ed5d3d5cdb	eal/linux: restore specific hugepage ordering for ppc An ifdef present in eal_memory.c references "RTE_ARCH_PPC64" when it should actually use "RTE_ARCH_PPC_64". Simple testing revealed that both the PPC_64 and non-PPC_64 versions of the code involved work, but the PPC_64 version of the code is retained to be consistent with other instances in the same file where mmapped memory is accessed in reverse order on Power platforms. Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists") Cc: stable@dpdk.org Signed-off-by: David Christensen <drc@linux.vnet.ibm.com> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>	2019-10-24 14:15:10 +02:00
Jim Harris	c1077933d4	timer: remove useless check on x86 TSC reliability This code was added 7+ years ago in commit fb022b85bae4 ("timer: check TSC reliability") presumably when variant TSCs were still somewhat common. But this code doesn't do anything except print a warning, and the warning doesn't give any kind of advice to the user, so let's just remove it. While the warning has no functional meaning, the /proc/cpuinfo parsing consumes a non-trivial amount of time which is especially noticeable in secondary processes. On my test system, it consumes 21ms out of the 66ms total execution time for rte_eal_init() in a secondary process. Signed-off-by: Jim Harris <james.r.harris@intel.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2019-10-17 09:47:42 +02:00
Xiaolong Ye	b34801d1aa	kni: support allmulticast mode set This patch adds support to allow users enable/disable allmulticast mode for kni interface. This requirement comes from bugzilla 312, more details can refer to: https://bugs.dpdk.org/show_bug.cgi?id=312 Bugzilla ID: 312 Signed-off-by: Xiaolong Ye <xiaolong.ye@intel.com> Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>	2019-10-15 21:16:32 +02:00
Arnon Warshavsky	75dbb45f28	eal: fix mapping leak in secondary process Have rte_eal_config_reattach clean up the mapped address which is a valid address but not the one intended. Coverity issue: 343439 Fixes: 4e8854ae89fa ("eal: do not panic on shared memory init") Fixes: b149a7064261 ("eal/freebsd: add config reattach in secondary process") Cc: stable@dpdk.org Signed-off-by: Arnon Warshavsky <arnon@qwilt.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com> Reviewed-by: David Marchand <david.marchand@redhat.com>	2019-10-15 20:37:11 +02:00
Jim Harris	773a860aef	vfio: fix leak with multiprocess The code checks both rte_mp_request_sync() return code and that the number of messages in the reply equals 1. If rte_mp_request_sync() succeeds but there was more than one message, those messages would get leaked. Found via code review by Anatoly Burakov of patches that used the vhost code as a template for using rte_mp_request_sync(). Fixes: 83a73c5fef66 ("vfio: use generic multi-process channel") Cc: stable@dpdk.org Reported-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Jim Harris <james.r.harris@intel.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-10-15 20:36:58 +02:00
David Marchand	8ac3591694	remove useless include of EAL memory config header Restrict this header inclusion to its real users. Fixes: 028669bc9f0d ("eal: hide shared memory config") Cc: stable@dpdk.org Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-10-09 10:22:24 +02:00
Anatoly Burakov	79a0bbe5b6	eal: pick IOVA as PA if IOMMU is not available When IOMMU is not available, /sys/kernel/iommu_groups will not be populated. This is happening since at least 3.6 when VFIO support was added. If the directory is empty, EAL should not pick IOVA as VA as the default IOVA mode. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Tested-by: Jerin Jacob <jerinj@marvell.com> Reviewed-by: Jerin Jacob <jerinj@marvell.com> Reviewed-by: David Marchand <david.marchand@redhat.com>	2019-07-30 10:09:13 +02:00
Anatoly Burakov	78a6d7ed19	vfio: use contiguous mapping for IOVA as VA mode When using IOVA as VA mode, there is no need to map segments page by page. This normally isn't a problem, but it becomes one when attempting to use DPDK in no-huge mode, where VFIO subsystem simply runs out of space to store mappings. Fix this for x86 by triggering different callbacks based on whether IOVA as VA mode is enabled. Fixes: 73a639085938 ("vfio: allow to map other memory regions") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Tested-by: Andrius Sirvys <andrius.sirvys@intel.com>	2019-07-23 20:47:14 +02:00
Nithin Dabilpuram	a159730c2f	eal: add ack interrupt API Add new ack interrupt API to avoid using VFIO_IRQ_SET_ACTION_TRIGGER(rte_intr_enable()) for acking interrupt purpose for VFIO based interrupt handlers. This implementation is specific to Linux. Using rte_intr_enable() for acking interrupt has below issues * Time consuming to do for every interrupt received as it will free_irq() followed by request_irq() and all other initializations * A race condition because of a window between free_irq() and request_irq() with packet reception still on and device still enabled and would throw warning messages like below. [158764.159833] do_IRQ: 9.34 No irq handler for vector In this patch, rte_intr_ack() is a no-op for VFIO_MSIX/VFIO_MSI interrupts as they are edge triggered and kernel would not mask the interrupt before delivering the event to userspace and we don't need to ack. Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Tested-by: Shahed Shaikh <shshaikh@marvell.com> Signed-off-by: David Marchand <david.marchand@redhat.com>	2019-07-23 12:00:22 +02:00
Nithin Dabilpuram	33543fb3b6	vfio: revert interrupt eventfd setup at probe This reverts commit 89aac60e0be9ed95a87b16e3595f102f9faaffb4. "vfio: fix interrupts race condition" The above mentioned commit moves the interrupt's eventfd setup to probe time but only enables one interrupt for all types of interrupt handles i.e VFIO_MSI, VFIO_LEGACY, VFIO_MSIX, UIO. It works fine with default case but breaks below cases specifically for MSIX based interrupt handles. * Applications like l3fwd-power that request rxq interrupts while ethdev setup. * Drivers that need > 1 MSIx interrupts to be configured for functionality to work. VFIO PCI for MSIx expects all the possible vectors to be setup up when using VFIO_IRQ_SET_ACTION_TRIGGER so that they can be allocated from kernel pci subsystem. Only way to increase the number of vectors later is first free all by using VFIO_IRQ_SET_DATA_NONE with action trigger and then enable new vector count. Above commit changes the behavior of rte_intr_[enable\|disable] to only mask and unmask unlike earlier behavior and thereby breaking above two scenarios. Fixes: 89aac60e0be9 ("vfio: fix interrupts race condition") Cc: stable@dpdk.org Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Tested-by: Stephen Hemminger <stephen@networkplumber.org> Tested-by: Shahed Shaikh <shshaikh@marvell.com> Tested-by: Lei Yao <lei.a.yao@intel.com> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-23 12:00:14 +02:00
Jerin Jacob	bbe29a9bd7	eal/linux: select IOVA as VA mode for default case When bus layer reports the preferred mode as RTE_IOVA_DC then select the RTE_IOVA_VA mode: - All drivers work in RTE_IOVA_VA mode, irrespective of physical address availability. - By default, a mempool asks for IOVA-contiguous memory using RTE_MEMZONE_IOVA_CONTIG. This is slow in RTE_IOVA_PA mode and it may affect the application boot time. Signed-off-by: Jerin Jacob <jerinj@marvell.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: David Marchand <david.marchand@redhat.com>	2019-07-22 17:47:27 +02:00
Takeshi Yoshimura	e072d16f89	vfio: fix expanding DMA area in ppc64le In ppc64le, expanding DMA areas always fail because we cannot remove a DMA window. As a result, we cannot allocate more than one memseg in ppc64le. This is because vfio_spapr_dma_mem_map() doesn't unmap all the mapped DMA before removing the window. This patch fixes this incorrect behavior. I also fixed the order of ioctl for unregister and unmap. The ioctl for unregister sometimes report device busy errors due to the existence of mapped area. Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com> Acked-by: David Christensen <drc@linux.vnet.ibm.com>	2019-07-16 12:56:03 +02:00
Yangchao Zhou	5eb1708ec1	kni: fix kernel crash with multi-segments va2pa depends on the physical address and virtual address offset of current mbuf. It may get the wrong physical address of next mbuf which allocated in another hugepage segment. In rte_mempool_populate_default(), trying to allocate whole block of contiguous memory could be failed. Then, it would reserve memory in several memzones that have different physical address and virtual address offsets. The rte_mempool_populate_default() is used by rte_pktmbuf_pool_create(). Fixes: 8451269e6d7b ("kni: remove continuous memory restriction") Cc: stable@dpdk.org Signed-off-by: Yangchao Zhou <zhouyates@gmail.com> Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>	2019-07-15 22:48:20 +02:00
Takeshi Yoshimura	22a55d2eb6	vfio: fix build on Linux < 4.2 The commit db90b4969e2e ("vfio: retry creating sPAPR DMA window") introduced a build breakage on old Linux. Linux <4.2 does not define ddw in struct vfio_iommu_spapr_tce_info. Without ddw, we cannot change window size and so should give up the creation. I just exculuded the retrying code if ddw is not supported. Fixes: db90b4969e2e ("vfio: retry creating sPAPR DMA window") Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com> Tested-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-07-11 11:28:20 +02:00
David Marchand	89aac60e0b	vfio: fix interrupts race condition Populating the eventfd in rte_intr_enable in each request to vfio triggers a reconfiguration of the interrupt handler on the kernel side. The problem is that rte_intr_enable is often used to re-enable masked interrupts from drivers interrupt handlers. This reconfiguration leaves a window during which a device could send an interrupt and then the kernel logs this (unsolicited from the kernel point of view) interrupt: [158764.159833] do_IRQ: 9.34 No irq handler for vector VFIO api makes it possible to set the fd at setup time. Make use of this and then we only need to ask for masking/unmasking legacy interrupts and we have nothing to do for MSI/MSIX. "rxtx" interrupts are left untouched but are most likely subject to the same issue. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1654824 Fixes: 5c782b3928b8 ("vfio: interrupts") Cc: stable@dpdk.org Signed-off-by: David Marchand <david.marchand@redhat.com> Tested-by: Shahed Shaikh <shshaikh@marvell.com>	2019-07-10 18:53:47 +02:00
Takeshi Yoshimura	db90b4969e	vfio: retry creating sPAPR DMA window sPAPR allows only page_shift from VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctl. However, Linux 4.17 or before returns incorrect page_shift for Power9. I added the code for retrying creation of sPAPR DMA window. Signed-off-by: Takeshi Yoshimura <tyos@jp.ibm.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-07-07 23:20:23 +02:00
Anatoly Burakov	ae3b4bc4fb	eal: prevent different primary/secondary process versions Currently, nothing stops DPDK to attempt to run primary and secondary processes while having different versions. This can lead to all sorts of weird behavior and makes it harder to maintain compatibility without breaking ABI every once in a while. Fix it by explicitly disallowing running different DPDK versions as primary and secondary processes. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-06 10:32:40 +02:00
Anatoly Burakov	b39fcd9569	eal: unify internal config init Currently, each EAL will update internal/shared config in their own way at init, resulting in needless duplication of code and OS-dependent behavior. Move the functions to a common file and add missing FreeBSD steps. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-06 10:32:40 +02:00
Anatoly Burakov	00299d3960	eal: unify wait for complete init Currently, mcfg completion function exists in two independent implementations doing the same thing, which is bug prone. Unify the two functions and move them into one place. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-06 10:32:40 +02:00
Anatoly Burakov	a08a5dd20e	eal: uninline wait for complete init Currently, the function to wait until config completion is static inline for no reason. Move its implementation to an EAL common file. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-06 10:32:40 +02:00
Anatoly Burakov	028669bc9f	eal: hide shared memory config Now that everything that has ever accessed the shared memory config is doing so through the public API's, we can make it internal. Since we're removing quite a few headers from rte_eal_memconfig.h, we need to add them back in places where this header is used. This bumps the ABI, so also change all build files and make update documentation. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-06 10:32:34 +02:00
Anatoly Burakov	76f80881ef	mem: add API to lock/unlock memory hotplug Currently, the memory hotplug is locked automatically by all memory-related _walk() functions, but sometimes locking the memory subsystem outside of them is needed. There is no public API to do that, so it creates a dependency on shared memory config to be public. Fix this by introducing a new API to lock/unlock the memory hotplug subsystem. Create a new common file for all things mem config, and a new API namespace rte_mcfg_*, and search-and-replace all usages of the locks with the new API. Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Marchand <david.marchand@redhat.com>	2019-07-05 22:12:40 +02:00
Ben Walker	c2361bab70	eal: compute IOVA mode based on PA availability Currently, if the bus selects IOVA as PA, the memory init can fail when lacking access to physical addresses. This can be quite hard for normal users to understand what is wrong since this is the default behavior. Catch this situation earlier in eal init by validating physical addresses availability, or select IOVA when no clear preferrence had been expressed. The bus code is changed so that it reports when it does not care about the IOVA mode and let the eal init decide. In Linux implementation, rework rte_eal_using_phys_addrs() so that it can be called earlier but still avoid a circular dependency with rte_mem_virt2phys(). In FreeBSD implementation, rte_eal_using_phys_addrs() always returns false, so the detection part is left as is. If librte_kni is compiled in and the KNI kmod is loaded, - if the buses requested VA, force to PA if physical addresses are available as it was done before, - else, keep iova as VA, KNI init will fail later. Signed-off-by: Ben Walker <benjamin.walker@intel.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>	2019-07-05 16:55:44 +02:00
David Marchand	cfe3aeb170	remove experimental tags from all symbol definitions We had some inconsistencies between functions prototypes and actual definitions. Let's avoid this by only adding the experimental tag to the prototypes. Tests with gcc and clang show it is enough. git grep -l __rte_experimental \|grep \.c$ \|while read file; do sed -i -e '/^__rte_experimental$/d' $file; sed -i -e 's/ __rte_experimental//' $file; sed -i -e 's/__rte_experimental //' $file; done Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com> Acked-by: Neil Horman <nhorman@tuxdriver.com>	2019-06-29 19:04:43 +02:00

1 2

85 Commits