rte_mbuf struct is something more likely will be used only in vhost-user
net driver, while we have made vhost-user generic enough that it can
be used for implementing other drivers (such as vhost-user SCSI), they
have also include <rte_mbuf.h>. Otherwise, the build will be broken.
We could workaround it by using forward declaration, so that other
non-net drivers won't need include <rte_mbuf.h>.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Rename "rte_virtio_net.h" to "rte_vhost.h", to not let it be virtio
net specific.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
We used to use rte_vhost_driver_session_start() to trigger the vhost-user
session. It takes no argument, thus it's a global trigger. And it could
be problematic.
The issue is, currently, rte_vhost_driver_register(path, flags) actually
tries to put it into the session loop (by fdset_add). However, it needs
a set of APIs to set a vhost-user driver properly:
* rte_vhost_driver_register(path, flags);
* rte_vhost_driver_set_features(path, features);
* rte_vhost_driver_callback_register(path, vhost_device_ops);
If a new vhost-user driver is registered after the trigger (think OVS-DPDK
that could add a port dynamically from cmdline), the current code will
effectively starts the session for the new driver just after the first
API rte_vhost_driver_register() is invoked, leaving later calls taking
no effect at all.
To handle the case properly, this patch introduce a new API,
rte_vhost_driver_start(path), to trigger a specific vhost-user driver.
To do that, the rte_vhost_driver_register(path, flags) is simplified
to create the socket only and let rte_vhost_driver_start(path) to
actually put it into the session loop.
Meanwhile, the rte_vhost_driver_session_start is removed: we could hide
the session thread internally (create the thread if it has not been
created). This would also simplify the application.
NOTE: the API order in prog guide is slightly adjusted for showing the
correct invoke order.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Export few APIs for the vhost-user driver to log the guest memory writes,
which is a must for live migration support.
This patch basically moves vhost_log_write() and vhost_log_used_vring()
into vhost.h and then add an wrapper (the public API) to them.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Features could be changed after the feature negotiation. For example,
VHOST_F_LOG_ALL will be set/cleared at the start/end of live migration,
respecitively. Thus, we need a new callback to inform the application
on such change.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Rename "virtio-net" to "vhost" in the API comments and vhost prog guide.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
rename "virtio_net_device_ops" to "vhost_device_ops", to not let it
be virtio-net specific.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
They are virtio-net specific and should be defined inside the virtio-net
driver.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Currently, we check vq->desc, vq->kickfd and vq->callfd to know whether
a virtio device is ready or not. However, we only do it when handling
SET_VRING_KICK message, which could be wrong if a vhost-user frontend
send SET_VRING_KICK first and SET_VRING_CALL later.
To work for all possible vhost-user frontend implementations, we could
move the ready check at the end of vhost-user message handler.
Meanwhile, since we do the check more often than before, the "virtio
not ready" message is dropped, to not flood the screen.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
We used to use rte_vhost_get_queue_num() for telling how many vrings.
However, the return value is the number of "queue pairs", which is
very virtio-net specific. To make it generic, we should return the
number of vrings instead, and let the driver do the proper translation.
Say, virtio-net driver could turn it to the number of queue pairs by
dividing 2.
Meanwhile, mark rte_vhost_get_queue_num as deprecated.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
The queue pair is very virtio-net specific, other devices don't have
such concept. To make it generic, we should log the number of vrings
instead of the number of queue pairs.
This patch just does a simple convert, a later patch would export the
number of vrings to applications.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Some vhost-user driver may need this info to setup its own page tables
for GPA (guest physical addr) to HPA (host physical addr) translation.
SPDK (Storage Performance Development Kit) is one example.
Besides, by exporting this memory info, we could also export the
gpa_to_vva() as an inline function, which helps for performance.
Otherwise, it has to be referenced indirectly by a "vid".
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Assume there is an application both support vhost-user net and
vhost-user scsi, the callback should be different. Making notify
ops per vhost driver allow application define different set of
callbacks for different driver.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Introduce few APIs to set/get/enable/disable driver features.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
A vhost-user server socket could have many connections, thus many connfd.
However, we currently just use one single int var to store it. Meaning,
it will get overwritten every time a new connection is created.
While this will not create fatal issue as it sounds (since the correct
connfd is closured to the event loop thread by fdset_add), it may cause
fd leaks if a user invokes rte_vhost_driver_unregister before shutting
down all connections: it just closes the recent connfd.
A simple example that should be able to reproduce this leaks issues is,
del the ovs vhost-user port while the connected VMs are still alive. (Note
that it's suggested to use one socket for one VM, which makes the issue
not that fatal as it sounds again).
Since we already use a struct "vhost_user_connection" to track all info
about one connection, it's obvious that we should put the connfd there.
Then we could build a connection list inside the vhost_user_socket struct,
to represent all connections belong that socket file.
Fixes: 164fd396788d ("vhost: fix unregistering in client mode")
Cc: stable@dpdk.org
Cc: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
A new interrupt type, RTE_INTR_HANDLE_VDEV, is added to support lsc and rxq
interrupt for vdev.
For lsc interrupt, except from original EPOLLIN events, we also listen for
socket peer closed connection event (EPOLLRDHUP and EPOLLHUP).
For rxq interrupt, add a precondition to avoid invoking any vfio and uio
code.
For intr_handle initialization, let each vdev driver to do that.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
The broadcast_rarp field in the virtio_net struct is checked in the
dequeue datapath regardless of whether descriptors are available or not.
As it is checked with cmpset leading to a write, false sharing on the
virtio_net struct can happen between enqueue and dequeue datapaths
regardless of whether a RARP is requested. In OVS, the issue can cause
a uni-directional performance drop of up to 15%.
Fix that by only performing the cmpset if a read of broadcast_rarp
indicates that the cmpset is likely to succeed.
Fixes: a66bcad32240 ("vhost: arrange struct fields for better cache sharing")
Cc: stable@dpdk.org
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch implements the function for the application to
get the MTU value.
rte_vhost_get_mtu() fills the mtu parameter with the MTU value
set in QEMU if VIRTIO_NET_F_MTU has been negotiated and returns 0,
-ENOTSUP otherwise.
The function returns -EAGAIN if Virtio feature negotiation
didn't happened yet.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch adds a new status flag indicating the Virtio device
is ready to operate.
This is required to be able to call rte_vhost_mtu_get() in the
.new_device() callback, as rte_vhost_mtu_get needs that the
negotiation is done, but it is too early to rely on running status
flag, which is set just after .new_device() returns.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch implements the vhost-user MTU protocol feature support.
When VIRTIO_NET_F_MTU is negotiated, QEMU notifies the vhost-user
backend with the configured MTU if dedicated protocol feature is
supported.
The value can be used by the application to ensure consistency with
value set by the user.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch enables the new VIRTIO_NET_F_MTU feature,
which makes possible for the host to advise the guest
with its maximum supported MTU.
MTU value is set via QEMU parameters, either via Libvirt XML, or
directly in virtio-net device command line arguments.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
We used to allocate queues based on the index from SET_VRING_CALL
request: if corresponding queue hasn't been allocated, allocate it.
Though it's pratically right (it's the first per-vring request we
will get from QEMU for vhost-user negotiation), but it's not technically
right: it's not documented in the vhost-user spec that it will always
be the first per-vring request. For example, SET_VRING_ADDR could also
be the first per-vring request.
Thus, we should not depend the SET_VRING_CALL on queue allocation.
Instead, we could catch all the per-vring messages at the entrance of
request handler, and allocate one if it hasn't been allocated before.
By that, we could remove a hack.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
0x8000 is the max virito-net queue pairs the virtio 1.0 spec claims to
support. While for vhost-user, it's a different story: the max vring
index could be passed by the vhost-user spec is 0xff, masked by the
VHOST_USER_VRING_IDX_MASK.
That said, the max queue pairs could vhost-user could supported is 0x80.
If user are asking more, I think the vhost-user need be extended.
Fixes: b09b198bfb5c ("vhost-user: announce queue number in message")
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Some macros (say VIRTIO_NET_F_MQ) are needed for enabling multiple queue,
however they are introduced since kernel v3.8, meaning build error happens
if we build DPDK vhost on those platforms.
71dfdbe66a66 ("vhost: fix build with kernel < 3.8") meant to fix it, but
in a wrong way: it completely disables the MQ features for those kernels.
However, the MQ feature doesn't depend on the kernel at all (except the
macros dependency stated above), that we could still enable the MQ feature
even the host kernel has no such support.
The right fix is to define the macro if it's not defined.
Fixes: 71dfdbe66a66 ("vhost: fix build with kernel < 3.8")
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Inability to connect to socket is a normal situation
in client mode because, in common case, server isn't
started yet. RTE_LOG_WARNING should be suitable for
the case of some unusual errors.
Message about reconnection is not an error at all.
Fixes: e623e0c6d8a5 ("vhost: add reconnect ability")
Cc: stable@dpdk.org
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
fdset_add increments pfdset->num, but fdset_del doesn't decrement
pfdset->num, so if we call fdset_add then fdset_del in a loop without
calling fdset_shrink, we can easily exceed MAX_FDS with only a few
number of fds used.
So my solution is simply to call fdset_shrink in fdset_add when it
exceeds MAX_FDS.
Because fdset_shrink and fdset_add locks pfdset->fd_mutex we can't
call fdset_shrink inside fdset_add because that would cause a dead
lock, so this patch split fdset_shrink in two, fdset_shrink and
fdset_shrink_nolock.
Fixes: 59317cef249c ("vhost: allow many vhost-user ports")
Cc: stable@dpdk.org
Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
In rte_eth_check_reta_mask(), it is required to align the size of the RETA
table to RTE_RETA_GROUP_SIZE but as the size can be less than the limit,
this should be removed. The change is also applied to a command of testpmd.
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
This patch adds MPLS and GRE items to generic rte flow.
Signed-off-by: Beilei Xing <beilei.xing@intel.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Prior to this patch only UIO/VFIO interrupt handlers types were supported.
This patch adds support for the external interrupt handler type, allowing
external drivers to set their own fds with specific interrupt handlers.
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
Add support for SLES12SP3, which uses kernel 4.4,
but backported features from newer kernels.
Signed-off-by: Nirmoy Das <ndas@suse.de>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
This commit adds support to the cfgfile library for parsing a key=value
line that has no value string specified (e.g., "key="). This can be used
to override a configuration attribute that has a default value or default
list of values to set it back to an undefined value to disable
functionality.
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Acked-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
When parsing a ini file with a "key = value" line that has both "key" and
"value" sized to the maximum allowed length causes a parsing failure. The
internal "buffer" variable should be sized at least as large as the maximum
for both fields. This commit updates the local array to be sized to hold
the max name, max value, " = ", and the nul terminator.
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Acked-by: Keith Wiles <keith.wiles@intel.com>
Acked-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
The call to memchr() uses the absolute length of the string buffer instead
of the actual length of the string returned by fgets(). This causes the
search to go beyond the '\n' character and find ';' characters in random
garbage on the stack. This then causes the 'len' variable to be updated
and the subsequent search for the '=' character to potentially find one
beyond the first newline character.
Since this bug relies on ';' and '=' characters appearing in random places
in the 'buffer' variable it is intermittently reproducible at best.
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Acked-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
The current cfgfile comment character is hardcoded to ';'. This commit a
new API to allow the user to specify which comment character to use while
parsing the file.
This is to ease adoption by applications that have an existing
configuration file which may use a different comment character. For
instance, an application may already have a configuration file that uses
the '#' as the comment character.
The approach of using a new API with an extensible parameters structure was
used rather than simply adding a new argument to the existing API to allow
for additional arguments to be introduced in the future.
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Acked-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
The current implementation of the cfgfile library requires that all
key=value pairs be within [SECTION] definitions. The ini file standard
allows for key=value pairs in an unnamed section.
https://en.wikipedia.org/wiki/INI_file#Global_properties
This commit adds the capability of parsing key=value pairs from such an
unnamed section. The CFG_FLAG_GLOBAL_SECTION flag must be passed to the
rte_cfgfile_load() API to enable this functionality. Any key=value pairs
found before the first section can be accessed in the section named
"GLOBAL".
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Acked-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
glibc 2.25 is warning about if applications depend on
sys/types.h for makedev macro, it expects to be included
from <sys/sysmacros.h>
Found this error while testing with GCC 6.3.1 on archlinux.
lib/librte_eal/linuxapp/eal/eal_pci_uio.c: In function ‘pci_mknod_uio_dev’:
lib/librte_eal/linuxapp/eal/eal_pci_uio.c:134:13:
error: In the GNU C Library, "makedev" is defined
by <sys/sysmacros.h>. For historical compatibility, it is
currently defined by <sys/types.h> as well, but we plan to
remove this soon. To use "makedev", include <sys/sysmacros.h>
directly. If you did not intend to use a system-defined macro
"makedev", you should undefine it after including <sys/types.h>. [-Werror]
dev = makedev(major, minor);
^~~~~~~~~~~~~~~~~
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
When loading nic_uio from /boot/loader.conf as specified in the Getting
Started Guide doc, the NIC devices were not bound at boot. Unloading the
nic_uio driver and reloading it would cause them to be bound, however.
The root cause appears to be the fact that when the module is loaded at
boot, the call to find the pci device when parsing the b:d:f parameter
fails to return the device. That means that later on when the device
is probed as part of a PCI scan, no action is taken as it's not recorded
as a device to be used.
We fix this by having the b:d:f string parsed again on probe if the
initial check to see if it's an already-known device fails. In my tests,
this causes the NIC devices to be successfully bound at boot time, as
well as leaving things working as before in the case the module is loaded
post-boot.
Fixes: 764bf26873b9 ("add FreeBSD support")
Cc: stable@dpdk.org
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
When binding with vfio-pci, secondary process cannot be started with
an error message:
cannot find TAILQ entry for PCI device.
It's due to: struct rte_pci_addr is padded with 1 byte for alignment
by compiler. Then below comparison in commit 2f4adfad0a69
("vfio: add multiprocess support") will fail if the last byte is not
initialized.
memcmp(&vfio_res->pci_addr, &dev->addr, sizeof(dev->addr)
And commit cdc242f260e7 ("eal/linux: support running as unprivileged user")
just triggers this bug by using a stack un-initialized variable.
The fix is to use rte_eal_compare_pci_addr() for pci addr comparison.
Fixes: 2f4adfad0a69 ("vfio: add multiprocess support")
Fixes: cdc242f260e7 ("eal/linux: support running as unprivileged user")
Cc: stable@dpdk.org
Reported-by: Pawel Rutkowski <pawelx.rutkowski@intel.com>
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
Some compilers require definition of vfio_iommu_spapr_tce_ddw_info
before its use in vfio_iommu_spapr_tce_info, so move tce_info
definition below tce_ddw_info.
Fixes: 468f42cc2645 ("vfio: fix build on old kernel")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Recently added "dma_zalloc_coherent()" call is causing build error
for Linux kernels < 3.2.
compile error:
lib/librte_eal/linuxapp/igb_uio/igb_uio.c:
In function ‘igbuio_pci_probe’:
lib/librte_eal/linuxapp/igb_uio/igb_uio.c:434:2:
error: implicit declaration of function ‘dma_zalloc_coherent’
[-Werror=implicit-function-declaration]
map_addr = dma_zalloc_coherent(&dev->dev, 1024,
^
dma_zalloc_coherent() introduced with Linux kernel 3.2, with commit
Linux: 842fa69f3e0c ("include/linux/dma-mapping.h: add dma_zalloc_coherent()")
Since it does not exist for older kernels, causing a build error.
Switched to dma_alloc_coherent() API to prevent build error.
Fixes: d287e4d41be0 ("igb_uio: map dummy DMA forcing IOMMU domain attachment")
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Moved from lib/librte_mempool, stack mempool handler is an independent
driver.
Shared builds would now require to link in librte_mempool_stack for
"stack" mempool handler.
Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
Moved from lib/librte_mempool, ring mempool is now an independent
driver.
Shared builds would now need to add librte_mempool_ring for:
* ring_mp_mc
* ring_sp_sc
* ring_sp_mc
* ring_mp_sc
Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
In case the stack or ring mempool handler are compiled as shared
library and not linked in with test binary, segfault is reported.
This is because return value of rte_mempool_set_ops_byname is not
being checked in rte_mempool_ops_alloc.
This patch handles error returned from rte_mempool_set_ops_byname
when a mempool is not found.
Fixes: 449c49b93a6b ("mempool: support handler operations")
Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>