This patch postpones rings addresses translations and checks, as
addresses sent by the master shuld not be interpreted as long as
ring is not started and enabled[0].
When protocol features aren't negotiated, the ring is started in
enabled state, so the addresses translations are postponed to
vhost_user_set_vring_kick().
Otherwise, it is postponed to when ring is enabled, in
vhost_user_set_vring_enable().
[0]: http://lists.nongnu.org/archive/html/qemu-devel/2017-05/msg04355.html
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
numa_realloc() reallocates the virtio_net device structure and
updates the vhost_devices[] table with the new pointer if the rings
are allocated different NUMA node.
Problem is that vhost_user_msg_handler() still dereferences old
pointer afterward.
This patch prevents this by fetching again the dev pointer in
vhost_devices[] after messages have been handled.
Fixes: af295ad469 ("vhost: realloc device and queues to same numa node as vring desc")
Cc: stable@dpdk.org
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
When VHOST_USER_F_PROTOCOL_FEATURES is negotiated, the ring is not
enabled when started, but enabled through dedicated
VHOST_USER_SET_VRING_ENABLE request.
When not negotiated, the ring is started in enabled state, at
VHOST_USER_SET_VRING_KICK request time.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Replace rte_vhost_gpa_to_vva() calls with vhost_iova_to_vva(), which
requires to also pass the mapped len and the access permissions needed.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
This patch introduces vhost_iova_to_vva() function to translate
guest's IO virtual addresses to backend's virtual addresses.
When IOMMU is enabled, the IOTLB cache is queried to get the
translation. If missing from the IOTLB cache, an IOTLB_MISS request
is sent to Qemu, and IOTLB cache is queried again on IOTLB event
notification.
When IOMMU is disabled, the passed address is a guest's physical
address, so the legacy rte_vhost_gpa_to_vva() API is used.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Vhost-user device IOTLB protocol extension introduces
VHOST_USER_IOTLB message type. The associated payload is the
vhost_iotlb_msg struct defined in Kernel, which in this was can
be either an IOTLB update or invalidate message.
On IOTLB update, the virtqueues get notified of a new entry.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
The per-virtqueue IOTLB cache init is done at virtqueue
init time. init_vring_queue() now takes vring id as parameter,
so that the IOTLB cache mempool name can be generated.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
In order to be able to handle other ports or queues while waiting
for an IOTLB miss reply, a pending list is created so that waiter
can return and restart later on with sending again a miss request.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
These defines and enums have been introduced in upstream kernel v4.8,
and backported to RHEL 7.4.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Currently, only QEMU sends requests, the backend sends
replies. In some cases, the backend may need to send
requests to QEMU, like IOTLB miss events when IOMMU is
supported.
This patch introduces a new channel for such requests.
QEMU sends a file descriptor of a new socket using
VHOST_USER_SET_SLAVE_REQ_FD.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
send_vhost_message() is currently only used to send
replies, so it modifies message flags to perpare the
reply.
With upcoming channel for backend initiated request,
this function can be used to send requests.
This patch introduces a new send_vhost_reply() that
does the message flags modifications, and makes
send_vhost_message() generic.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
In the non-mergeable receive case, when copy_mbuf_to_desc()
call fails the packet is skipped, the corresponding used element
len field is set to vnet header size, and it continues with next
packet/desc. It could be a problem because it does not know why
it failed, and assume the desc buffer is large enough.
In mergeable receive case, when copy_mbuf_to_desc_mergeable()
fails, packets burst is simply stopped.
This patch makes the non-mergeable error path to behave as the
mergeable one, as it seems the safest way. Also, doing this way
will simplify pending IOTLB miss requests handling.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
This reverts commit 04d8122796 ("vhost: workaround MQ fails to
startup").
As agreed when this workaround was introduced, it can be reverted
as Qemu v2.10 that fixes the issue is now out.
The reply-ack feature is required for vhost-user IOMMU support.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
The unscrutinized value may be incorrectly assumed to be within a certain
range by later operations.
In vhost_user_read: An unscrutinized value from an untrusted source used
in a trusted context - the value of sz_payload may be harmfull and we need
limit them to the max value of payload.
Coverity issue: 139601
Fixes: 6a84c37e39 ("net/virtio-user: add vhost-user adapter layer")
Cc: stable@dpdk.org
Signed-off-by: Daniel Mrzyglod <danielx.t.mrzyglod@intel.com>
Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
The simple Rx handler is selected even if Rx checksum offload is
requested by the application, but this handler does not support
offloads. This results in broken received packets (no checksum flag but
invalid checksum in the mbuf data).
Disable the simple Rx handler in that case.
Fixes: 96cb671193 ("net/virtio: support Rx checksum offload")
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Split use_simple_rxtx into use_simple_rx and use_simple_tx,
and ensure that only use_simple_tx is updated when txq flags
forces to use the standard Tx handler.
This change is also useful for next commit (disable simple Rx
path when Rx checksum is requested).
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Since commit f27769f796 ("mk: require SSE4.2 support on all x86
platforms"), SSE4.2 is a requirement when compiling on x86 platforms.
We can remove this check in the virtio driver.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
The selection of Rx/Tx handlers is done at several places,
group them in one function set_rxtx_funcs().
The update of hw->use_simple_rxtx is also rationalized:
- initialized to 1 (prefer simple path)
- in dev configure or rx/tx queue setup, if something prevents from
using the simple path, change it to 0.
- in dev start, set the handlers according to hw->use_simple_rxtx.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
In rx/tx queue setup functions, some code is executed only if
use_simple_rxtx == 1. The value of this variable can change depending on
the offload flags or sse support. If Rx queue setup is called before Tx
queue setup, it can result in an invalid configuration:
- dev_configure is called: use_simple_rxtx is initialized to 0
- rx queue setup is called: queues are initialized without simple path
support
- tx queue setup is called: use_simple_rxtx switch to 1, and simple
Rx/Tx handlers are selected
Fix this by postponing a part of Rx/Tx queue initialization in
dev_start(), as it was the case in the initial implementation.
Fixes: 48cec290a3 ("net/virtio: move queue configure code to proper place")
Cc: stable@dpdk.org
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
The mbuf->port was was not properly set for the first received
mbufs. Fix this by setting it in virtqueue_enqueue_recv_refill_simple(),
which is used to enqueue the first mbuf in the ring.
The function virtio_rxq_rearm_vec(), which is used to rearm the ring
with new mbufs, is correct and does not need to be updated.
Fixes: cab0461234 ("virtio: fill Rx avail ring with blank mbufs")
Cc: stable@dpdk.org
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
On error, we should log with error level.
Fixes: 9f4f2846ef ("virtio: support vlan filtering")
Fixes: 86d59b2146 ("net/virtio: support LRO")
Fixes: 96cb671193 ("net/virtio: support Rx checksum offload")
Cc: stable@dpdk.org
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
As described in API documentation, the field hw_ip_checksum
requests both L3 and L4 offload.
Fixes: dad1ec72a3 ("doc: document NIC features")
Cc: stable@dpdk.org
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
This reverts
commit 4dab342b75 ("net/virtio: do not falsely claim to do IP checksum").
The description of rxmode->hw_ip_checksum is:
hw_ip_checksum : 1, /**< IP/UDP/TCP checksum offload enable. */
Despite its name, this field can be set by an application to enable L3
and L4 checksums. In case of virtio, only L4 checksum is supported and
L3 checksums flags will always be set to "unknown".
Fixes: 4dab342b75 ("net/virtio: do not falsely claim to do IP checksum")
Cc: stable@dpdk.org
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
This reverts
commit 701a64622c ("net/virtio: do not claim to support LRO")
Setting rxmode->enable_lro is a way to tell the host that the guest is
ok to receive tso packets. From the guest point of view, it is like
enabling LRO on a physical driver.
Fixes: 701a64622c ("net/virtio: do not claim to support LRO")
Cc: stable@dpdk.org
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Acccording to the vhost-user spec [0], client must start ring
upon receiving a kick (that is, detecting that file descriptor
is reachable) on the descriptor specified by VHOST_USER_SET_VRING_KICK.
The code sends a kick to the rx queue. It is missing sending a
kick for the tx queue. This patch is to add the missing code to
comply with the spec.
[0]: https://fossies.org/linux/qemu/docs/specs/vhost-user.txt
Signed-off-by: Steven Luong <sluong@cisco.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
This patch adaptively batches the small guest memory copies.
By batching the small copies, the efficiency of executing the
memory LOAD instructions can be improved greatly, because the
memory LOAD latency can be effectively hidden by the pipeline.
We saw great performance boosts for small packets PVP test.
This patch improves the performance for small packets, and has
distinguished the packets by size. So although the performance
for big packets doesn't change, it makes it relatively easy to
do some special optimizations for the big packets too.
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
To use macro instead of magic number in order to enhance code
readability.
Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yliu@fridaylinux.org>
Split pci_vfio_map_resource for primary and secondary processes.
Save all relevant mapping data in primary process to allow
the secondary process to perform mappings.
Signed-off-by: Jonas Pfefferle <jpf@zurich.ibm.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
DMA window size needs to be big enough to span all memory segment's
physical addresses. We do not need multiple levels of IOMMU tables
as we already span ~70TB of physical memory with 16MB hugepages.
Signed-off-by: Jonas Pfefferle <jpf@zurich.ibm.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
With the IOVA auto detection changes, bus scan is performed before
memory initialization. DPAA bus scan must not use rte_malloc in
its path.
Fixes: cf408c2247 ("eal: auto detect IOVA mode")
Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
When I was adding mlockall() to the testpmd application it was
suggested to add a reference to the use case of mlockall(). This patch
adds is.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: John McNamara <john.mcnamara@intel.com>
Call the mlockall() function, to attempt to lock all of its process
memory into physical RAM, and preventing the kernel from paging any
of its memory to disk.
When using testpmd for performance testing, depending on the code path
taken, we see a couple of page faults in a row. These faults effect
the overall drop-rate of testpmd. On Linux the mlockall() call will
prefault all the pages of testpmd (and the DPDK libraries if linked
dynamically), even without LD_BIND_NOW.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Normally, command line argument strings are considered immutable, but
SPDK [1] and urdma [2] construct argv arrays to pass to rte_eal_init().
These strings are allocated using malloc() and freed after DPDK
initialization with free(). However, in the case of --file-prefix and
--huge-dir, DPDK takes the pointer to these strings in argv directly. If
a secondary process calls rte_eal_pci_probe() after rte_eal_init()
returns, as is done by SPDK, this causes a use-after-free error because
the strings have been freed by the calling code immediately after
rte_eal_init() returns.
This problem was observed when running SPDK example programs as a
secondary process and causes the secondary processes to fail:
Starting DPDK 16.11.1 initialization...
[ DPDK EAL parameters: identify -c 4 --file-prefix=spdk3260 --base-virtaddr=0x1000000000 --proc-type=auto ]
EAL: Detected 40 lcore(s)
EAL: Auto-detected process type: SECONDARY
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:81:00.0 on NUMA socket 1
EAL: probe driver: 8086:953 spdk_nvme
EAL: cannot connect to primary process!
EAL: Error - exiting with code: 1
Cause: Requested device 0000:81:00.0 cannot be used
Running strace shows that the file prefix has been zero'd out by the
time that the secondary process attempts to probe the NVMe device.
The use-after-free errors can be easily detected with valgrind:
==8489== Invalid read of size 1
==8489== at 0x4C30D22: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8489== by 0x58DB955: vfprintf (vfprintf.c:1637)
==8489== by 0x59A4685: __vsnprintf_chk (vsnprintf_chk.c:63)
==8489== by 0x59A45E7: __snprintf_chk (snprintf_chk.c:34)
==8489== by 0x1246AB: get_socket_path.constprop.0 (in /home/pmacarth/src/spdk/examples/nvme/identify/identify)
==8489== by 0x124B09: vfio_mp_sync_connect_to_primary (in /home/pmacarth/src/spdk/examples/nvme/identify/identify)
==8489== by 0x123BE4: vfio_get_group_fd.part.1 (in /home/pmacarth/src/spdk/examples/nvme/identify/identify)
==8489== by 0x124366: vfio_setup_device (in /home/pmacarth/src/spdk/examples/nvme/identify/identify)
==8489== by 0x126C8A: pci_vfio_map_resource (in /home/pmacarth/src/spdk/examples/nvme/identify/identify)
==8489== by 0x12B115: pci_probe_all_drivers.part.0 (in /home/pmacarth/src/spdk/examples/nvme/identify/identify)
==8489== by 0x12B596: rte_eal_pci_probe (in /home/pmacarth/src/spdk/examples/nvme/identify/identify)
==8489== by 0x11D5B5: spdk_pci_enumerate (pci.c:147)
==8489== Address 0x63f362e is 14 bytes inside a block of size 32 free'd
==8489== at 0x4C2ED5B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8489== by 0x11E6FB: spdk_free_args (init.c:136)
==8489== by 0x11EBF5: spdk_env_init (init.c:309)
==8489== by 0x10D2AA: main (identify.c:976)
==8489== Block was alloc'd at
==8489== at 0x4C2DB2F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8489== by 0x11E7D7: _sprintf_alloc (init.c:76)
==8489== by 0x11EA78: spdk_build_eal_cmdline (init.c:251)
==8489== by 0x11EA78: spdk_env_init (init.c:282)
==8489== by 0x10D2AA: main (identify.c:976)
==8489==
Fix this by using strdup() to create separate memory buffers for these
strings. Note that this patch will cause valgrind to report memory
leaks of these buffers as there is nowhere to free them. Using static
buffers is an option but would make these strings have a fixed maximum
length whereas there is currently no limit defined by the API.
[1] http://spdk.io
[2] https://github.com/zrlio/urdma
Fixes: af75078fec ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Patrick MacArthur <patrick@patrickmacarthur.net>
Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
If mmap fails, it will return the value MAP_FAILED. Checking for this
return code allows us to properly identify mmap failures and report
them as such to the calling function.
Signed-off-by: Seth Howell <seth.howell@intel.com>
Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
malloc_elem_free() is clearing(setting to 0) the trailer cookie when
RTE_MALLOC_DEBUG is enabled. In case of joining free neighbor element,
part of joined memory is not getting cleared due to missing the length
of trailer cookie in the middle.
This patch fixes calculation of free memory length to be cleared in
malloc_elem_free() by including trailer cookie.
Fixes: af75078fec ("first public release")
Cc: stable@dpdk.org
Signed-off-by: Xueming Li <xuemingl@mellanox.com>
Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
Currently, enabling assertion have to set CONFIG_RTE_LOG_LEVEL to
RTE_LOG_DEBUG. CONFIG_RTE_LOG_LEVEL is the default log level of control
path, RTE_LOG_DP_LEVEL is the log level of data path. It's a little bit
hard to understand literally that assertion is decided by control path
LOG_LEVEL, especially assertion used on data path.
On the other hand, DPDK need an assertion enabling switch w/o impacting
log output level, assuming "--log-level" not specified.
Assertion is an important API to balance DPDK high performance and
robustness. To promote assertion usage, it's valuable to unhide
assertion out of COFNIG_RTE_LOG_LEVEL.
In one word, log is log, assertion is assertion, debug is hot pot :)
Rationale of this patch is to introduce an dedicate switch of
assertion: RTE_ENABLE_ASSERT
Signed-off-by: Xueming Li <xuemingl@mellanox.com>
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
The CPUs which support AVX512 have been released. Add support for
checking AVX512F instruction set.
Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
drivers/mempool/octeontx/octeontx_fpavf.c(789):
error #592: variable "fpa" is used before its value is set
RTE_SET_USED(fpa);
Fixes: 1c842786fe ("mempool/octeontx: probe fpavf PCIe devices")
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
We remove xen-specific code in EAL, including the option --xen-dom0,
memory initialization code, compiling dependency, etc.
Related documents are removed or updated, and bump the eal library
version.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Previously, to get MFN address in dom0, this API is a wrapper to
obtain the "physical address".
As we will removed xen dom0 support, this API is not necessary.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Load huge realloc_sections.ini file to check malloc/realloc
ability of cfgfile library.
Signed-off-by: Jacek Piasecki <jacekx.piasecki@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
New functions added to cfgfile library make it possible
to significantly simplify the code of rte_cfgfile_load_with_params()
This patch shows the new body of this function.
Signed-off-by: Jacek Piasecki <jacekx.piasecki@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>