Since vhost_user_set_features failure is not handled in any way, a
single error log has been added to at least to let the user know that
something has gone wrong.
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
The queue allocation was changed, from allocating one queue-pair at a
time to one queue at a time. Most of the changes have been done, but
just with one being missed: the size of copying the old queue is still
based on queue-pair at numa_realloc(), which leads to overwritten issue.
As a result, crash may happen.
Fix it by specifying the right copy size. Also, the net queue macros
are not used any more. Remove them.
Fixes: ab4d7b9f1afc ("vhost: turn queue pair to vring")
Cc: stable@dpdk.org
Reported-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Jens Freimann <jfreiman@redhat.com>
Tested-by: Ciara Loftus <ciara.loftus@intel.com>
Accessing fields of a packed struct through unaligned pointers is
undefined behavior. Instead of passing pointers to particular fields,
a pointer to the root struct should be used. This patch does exactly
that.
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch fixes a memory leak.
virtio_net::guest_pages is allocated in vhost_setup_mem_table(),
reallocated in add_one_guest_page(), but never freed.
Fixes: e246896178e6 ("vhost: get guest/host physical address mappings")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Reviewed-by: Jens Freimann <jfreiman@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch implements the ops rx_queue_count for vhost PMD by adding
a helper function rte_vhost_rx_queue_count in vhost lib.
The ops rx_queue_count gets vhost RX queue avail count and helps to
understand the queue fill level.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
Acked-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
When we try to allocate guest pages we need to check the return value of
malloc(). Print an error message and return when it fails.
Signed-off-by: Jens Freimann <jfreiman@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Fixing typos across dpdk source code using codespell utility.
Skipped the ethdev driver's base code fixes to keep the base
code intact.
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: John McNamara <john.mcnamara@intel.com>
Different drivers use internal macros like force_inline for compiler
always inline feature.
Standardizing it through __rte_always_inline macro.
Verified the change by comparing the output binary file.
No difference found in the output binary file with this change.
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
vhost since dpdk17.02 + qemu2.7 and above will cause failures of
new connection when negotiating to set MQ. (one queue pair works
well).
Because there exist some bugs in qemu code when introducing
VHOST_USER_PROTOCOL_F_REPLY_ACK to qemu. when dealing with the vhost
message VHOST_USER_SET_MEM_TABLE for the second time, qemu indeed
doesn't send the messge (The message needs to be sent only once)but
still will be waiting for dpdk's reply ack, then, qemu is always
freezing. DPDK code indeed works in the right way.
The feature VHOST_USER_PROTOCOL_F_REPLY_ACK has to be disabled
by default at the dpdk side in order to avoid the feature support of
DPDK + qemu at the same time. if doing like that, MQ can works well.
Cc: stable@dpdk.org
Reported-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
Tested-by: Ciara Loftus <ciara.loftus@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Exported headers must allow compilation with the strictest flags. This
commit addresses the following errors:
In file included from /tmp/check-includes.sh.20132.c:1:0:
build/include/rte_vhost.h:73:30: error: ISO C forbids zero-size array
'regions' [-Werror=pedantic]
[...]
Also:
- Add C++ awareness to rte_vhost.h for consistency with rte_eth_vhost.h.
- Move Linux includes into C++ block to prevent linking issues with
exported symbols.
- Update check-includes.sh following the removal of rte_virtio_net.h.
Finally, update check-includes.sh to ignore rte_vhost.h and rte_eth_vhost.h
from now on since the Linux headers they depend on are not clean enough:
In file included from /usr/include/linux/vhost.h:17:0,
from build/include/rte_vhost.h:43,
from build/include/rte_eth_vhost.h:44,
from /tmp/check-includes.sh.20132.c:1:
/usr/include/linux/virtio_ring.h: In function 'vring_init':
/usr/include/linux/virtio_ring.h:146:16: error: pointer of type 'void *'
used in arithmetic [-Werror=pointer-arith]
[...]
In file included from build/include/rte_vhost.h:43:0,
from build/include/rte_eth_vhost.h:44,
from /tmp/check-includes.sh.20132.c:1:
/usr/include/linux/vhost.h: At top level:
/usr/include/linux/vhost.h:73:3: error: ISO C99 doesn't support unnamed
structs/unions [-Werror=pedantic]
[...]
Fixes: eb32247457fe ("vhost: export guest memory regions")
Fixes: a798beb47c8e ("vhost: rename header file")
Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
For zero copy mode, we need pin the mbuf to not let the underlaying PMD
driver (or the app) free the mbuf. Currently, only the heading mbuf is
pinned. However, the mbuf free function would try to free all mbufs
in the mbuf chain (-1 to the refcnt). This may lead the head mbuf being
still pinned, while the other subsequent mbufs are actually freed. Which
is wrong.
It becomes more fatal after the mbuf refactor, more specificly, after
the commit 8f094a9ac5d7 ("mbuf: set mbuf fields while in pool"). The
refcnt resets to 1 after the last real reference. OTOH, it leads to a
situtation that we never know one mbuf is actually freed or not. This
would result the mbuf __just__ after the heading mbuf being freed twice:
it's firstly freed (and put back to mempool) when the underlaying PMD
finishes the DMA. Later, it will then be freed again when vhost unpins
it. Meaning, one mbuf may be returned to the mempool twice, while in
turn, being allocated twice later. Something uncertain may happen then.
For example, the VM2VM case becomes broken.
Fixes: b0a985d1f340 ("vhost: add dequeue zero copy")
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Like what we did for virtio PMD driver [0][1], we could also apply such
trick to vhost, to avoid the memory write on net header when necessary.
[0]: c9ea670c1dc7 ("net/virtio: fix performance regression due to TSO")
[1]: 16994abee215 ("net/virtio: optimize header reset on any layout")
With this, the cache issue of the mergeable path is again greatly reduced:
even the write of "num_buffers" could be avoided. A quick PVP test shows
the gap between the mergeable Rx and non-mergeable Rx is pretty small now:
they are basically the same in my test.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
A "return" is missing on error, which could lead to a "use after free"
issue (about var "conn").
Coverity issue: 143476
Fixes: 65388b43f592 ("vhost: fix fd leaks for vhost-user server mode")
Reported-by: John McNamara <john.mcnamara@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
It doesn't make any sense to invoke destroy_device() callback at
while handling SET_MEM_TABLE message.
From the vhost-user spec, it's the GET_VRING_BASE message indicates
the end of a vhost device: the destroy_device() should be invoked
from there (luckily, we already did that).
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
rte_mbuf struct is something more likely will be used only in vhost-user
net driver, while we have made vhost-user generic enough that it can
be used for implementing other drivers (such as vhost-user SCSI), they
have also include <rte_mbuf.h>. Otherwise, the build will be broken.
We could workaround it by using forward declaration, so that other
non-net drivers won't need include <rte_mbuf.h>.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Rename "rte_virtio_net.h" to "rte_vhost.h", to not let it be virtio
net specific.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
We used to use rte_vhost_driver_session_start() to trigger the vhost-user
session. It takes no argument, thus it's a global trigger. And it could
be problematic.
The issue is, currently, rte_vhost_driver_register(path, flags) actually
tries to put it into the session loop (by fdset_add). However, it needs
a set of APIs to set a vhost-user driver properly:
* rte_vhost_driver_register(path, flags);
* rte_vhost_driver_set_features(path, features);
* rte_vhost_driver_callback_register(path, vhost_device_ops);
If a new vhost-user driver is registered after the trigger (think OVS-DPDK
that could add a port dynamically from cmdline), the current code will
effectively starts the session for the new driver just after the first
API rte_vhost_driver_register() is invoked, leaving later calls taking
no effect at all.
To handle the case properly, this patch introduce a new API,
rte_vhost_driver_start(path), to trigger a specific vhost-user driver.
To do that, the rte_vhost_driver_register(path, flags) is simplified
to create the socket only and let rte_vhost_driver_start(path) to
actually put it into the session loop.
Meanwhile, the rte_vhost_driver_session_start is removed: we could hide
the session thread internally (create the thread if it has not been
created). This would also simplify the application.
NOTE: the API order in prog guide is slightly adjusted for showing the
correct invoke order.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Export few APIs for the vhost-user driver to log the guest memory writes,
which is a must for live migration support.
This patch basically moves vhost_log_write() and vhost_log_used_vring()
into vhost.h and then add an wrapper (the public API) to them.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Features could be changed after the feature negotiation. For example,
VHOST_F_LOG_ALL will be set/cleared at the start/end of live migration,
respecitively. Thus, we need a new callback to inform the application
on such change.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Rename "virtio-net" to "vhost" in the API comments and vhost prog guide.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
rename "virtio_net_device_ops" to "vhost_device_ops", to not let it
be virtio-net specific.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
They are virtio-net specific and should be defined inside the virtio-net
driver.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Currently, we check vq->desc, vq->kickfd and vq->callfd to know whether
a virtio device is ready or not. However, we only do it when handling
SET_VRING_KICK message, which could be wrong if a vhost-user frontend
send SET_VRING_KICK first and SET_VRING_CALL later.
To work for all possible vhost-user frontend implementations, we could
move the ready check at the end of vhost-user message handler.
Meanwhile, since we do the check more often than before, the "virtio
not ready" message is dropped, to not flood the screen.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
We used to use rte_vhost_get_queue_num() for telling how many vrings.
However, the return value is the number of "queue pairs", which is
very virtio-net specific. To make it generic, we should return the
number of vrings instead, and let the driver do the proper translation.
Say, virtio-net driver could turn it to the number of queue pairs by
dividing 2.
Meanwhile, mark rte_vhost_get_queue_num as deprecated.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
The queue pair is very virtio-net specific, other devices don't have
such concept. To make it generic, we should log the number of vrings
instead of the number of queue pairs.
This patch just does a simple convert, a later patch would export the
number of vrings to applications.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Some vhost-user driver may need this info to setup its own page tables
for GPA (guest physical addr) to HPA (host physical addr) translation.
SPDK (Storage Performance Development Kit) is one example.
Besides, by exporting this memory info, we could also export the
gpa_to_vva() as an inline function, which helps for performance.
Otherwise, it has to be referenced indirectly by a "vid".
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Assume there is an application both support vhost-user net and
vhost-user scsi, the callback should be different. Making notify
ops per vhost driver allow application define different set of
callbacks for different driver.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Introduce few APIs to set/get/enable/disable driver features.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
A vhost-user server socket could have many connections, thus many connfd.
However, we currently just use one single int var to store it. Meaning,
it will get overwritten every time a new connection is created.
While this will not create fatal issue as it sounds (since the correct
connfd is closured to the event loop thread by fdset_add), it may cause
fd leaks if a user invokes rte_vhost_driver_unregister before shutting
down all connections: it just closes the recent connfd.
A simple example that should be able to reproduce this leaks issues is,
del the ovs vhost-user port while the connected VMs are still alive. (Note
that it's suggested to use one socket for one VM, which makes the issue
not that fatal as it sounds again).
Since we already use a struct "vhost_user_connection" to track all info
about one connection, it's obvious that we should put the connfd there.
Then we could build a connection list inside the vhost_user_socket struct,
to represent all connections belong that socket file.
Fixes: 164fd396788d ("vhost: fix unregistering in client mode")
Cc: stable@dpdk.org
Cc: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
The broadcast_rarp field in the virtio_net struct is checked in the
dequeue datapath regardless of whether descriptors are available or not.
As it is checked with cmpset leading to a write, false sharing on the
virtio_net struct can happen between enqueue and dequeue datapaths
regardless of whether a RARP is requested. In OVS, the issue can cause
a uni-directional performance drop of up to 15%.
Fix that by only performing the cmpset if a read of broadcast_rarp
indicates that the cmpset is likely to succeed.
Fixes: a66bcad32240 ("vhost: arrange struct fields for better cache sharing")
Cc: stable@dpdk.org
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch implements the function for the application to
get the MTU value.
rte_vhost_get_mtu() fills the mtu parameter with the MTU value
set in QEMU if VIRTIO_NET_F_MTU has been negotiated and returns 0,
-ENOTSUP otherwise.
The function returns -EAGAIN if Virtio feature negotiation
didn't happened yet.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch adds a new status flag indicating the Virtio device
is ready to operate.
This is required to be able to call rte_vhost_mtu_get() in the
.new_device() callback, as rte_vhost_mtu_get needs that the
negotiation is done, but it is too early to rely on running status
flag, which is set just after .new_device() returns.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch implements the vhost-user MTU protocol feature support.
When VIRTIO_NET_F_MTU is negotiated, QEMU notifies the vhost-user
backend with the configured MTU if dedicated protocol feature is
supported.
The value can be used by the application to ensure consistency with
value set by the user.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch enables the new VIRTIO_NET_F_MTU feature,
which makes possible for the host to advise the guest
with its maximum supported MTU.
MTU value is set via QEMU parameters, either via Libvirt XML, or
directly in virtio-net device command line arguments.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
We used to allocate queues based on the index from SET_VRING_CALL
request: if corresponding queue hasn't been allocated, allocate it.
Though it's pratically right (it's the first per-vring request we
will get from QEMU for vhost-user negotiation), but it's not technically
right: it's not documented in the vhost-user spec that it will always
be the first per-vring request. For example, SET_VRING_ADDR could also
be the first per-vring request.
Thus, we should not depend the SET_VRING_CALL on queue allocation.
Instead, we could catch all the per-vring messages at the entrance of
request handler, and allocate one if it hasn't been allocated before.
By that, we could remove a hack.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
0x8000 is the max virito-net queue pairs the virtio 1.0 spec claims to
support. While for vhost-user, it's a different story: the max vring
index could be passed by the vhost-user spec is 0xff, masked by the
VHOST_USER_VRING_IDX_MASK.
That said, the max queue pairs could vhost-user could supported is 0x80.
If user are asking more, I think the vhost-user need be extended.
Fixes: b09b198bfb5c ("vhost-user: announce queue number in message")
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Some macros (say VIRTIO_NET_F_MQ) are needed for enabling multiple queue,
however they are introduced since kernel v3.8, meaning build error happens
if we build DPDK vhost on those platforms.
71dfdbe66a66 ("vhost: fix build with kernel < 3.8") meant to fix it, but
in a wrong way: it completely disables the MQ features for those kernels.
However, the MQ feature doesn't depend on the kernel at all (except the
macros dependency stated above), that we could still enable the MQ feature
even the host kernel has no such support.
The right fix is to define the macro if it's not defined.
Fixes: 71dfdbe66a66 ("vhost: fix build with kernel < 3.8")
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Inability to connect to socket is a normal situation
in client mode because, in common case, server isn't
started yet. RTE_LOG_WARNING should be suitable for
the case of some unusual errors.
Message about reconnection is not an error at all.
Fixes: e623e0c6d8a5 ("vhost: add reconnect ability")
Cc: stable@dpdk.org
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
fdset_add increments pfdset->num, but fdset_del doesn't decrement
pfdset->num, so if we call fdset_add then fdset_del in a loop without
calling fdset_shrink, we can easily exceed MAX_FDS with only a few
number of fds used.
So my solution is simply to call fdset_shrink in fdset_add when it
exceeds MAX_FDS.
Because fdset_shrink and fdset_add locks pfdset->fd_mutex we can't
call fdset_shrink inside fdset_add because that would cause a dead
lock, so this patch split fdset_shrink in two, fdset_shrink and
fdset_shrink_nolock.
Fixes: 59317cef249c ("vhost: allow many vhost-user ports")
Cc: stable@dpdk.org
Signed-off-by: Matthias Gatto <matthias.gatto@outscale.com>
Before this patch, the management of dependencies between directories
had several issues:
- the generation of .depdirs, done at configuration is slow: it can take
more than one minute on some slow targets (usually ~10s on a standard
PC without -j).
- for instance, it is possible to express a dependency like:
- app/foo depends on lib/librte_foo
- and lib/librte_foo depends on app/bar
But this won't work because the directories are traversed with a
depth-first algorithm, so we have to choose between doing 'app' before
or after 'lib'.
- the script depdirs-rule.sh is too complex.
- we cannot use "make -d" for debug, because the output of make is used for
the generation of .depdirs.
This patch moves the DEPDIRS-* variables in the upper Makefile, making
the dependencies much easier to calculate. A DEPDIRS variable is still
used to process library dependencies in LDLIBS.
After this commit, "make config" is almost immediate.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Tested-by: Robin Jarry <robin.jarry@6wind.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Found with clang static analysis:
lib/librte_vhost/vhost_user.c:996:3: warning:
Value stored to 'ret' is never read
ret = vhost_user_get_vring_base(dev, &msg.payload.state);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Emmanuel Roullit <emmanuel.roullit@gmail.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Found with clang static analysis:
lib/librte_vhost/virtio_net.c:723:17: warning:
Access to field 'data_off' results in a dereference of a null pointer
(loaded from variable 'tcp_hdr')
m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
^~~~~~~~~~~~~~~~~
Fixes: d0cf91303d73 ("vhost: add Tx offload capabilities")
Cc: stable@dpdk.org
Signed-off-by: Emmanuel Roullit <emmanuel.roullit@gmail.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Setting up the mapping from GPA (guest physical address) to HPA (guest
physical address) could be very time consuming when the guest memory is
backened with small pages (4K). The bigger the guest memory, the longer
it takes. This could lead a very long vhost-user negotiation.
Since the mapping is only needed in zero copy mode so far, we could
avoid such time consuming settup when zero copy is turned off (which is
the default case).
It's actually a workaround, a right fix might be to start a new thread,
and hide the big latency there.
Fixes: e246896178e6 ("vhost: get guest/host physical address mappings")
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>