Fix issue reported by Coverity.
Coverity ID 124556
If the buffer is treated as a null terminated string in later operations,
a buffer overflow or over-read may occur.
In vhost_set_ifname: The string buffer may not have a null terminator if
the source string's length is equal to the buffer size
Fixes: 54292e9520 ("vhost: support ifname for vhost-user")
Signed-off-by: Daniel Mrzyglod <danielx.t.mrzyglod@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
We have to reset the virtio net hdr at virtio_enqueue_offload()
before, due to all mbufs share a single virtio_hdr structure:
struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, }, 0};
foreach (mbuf) {
virtio_enqueue_offload(mbuf, &virtio_hdr.hdr);
copy net hdr and mbuf to desc buf
}
However, after the vhost rxtx refactor, the code looks like:
copy_mbuf_to_desc(mbuf)
{
struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, }, 0}
virtio_enqueue_offload(mbuf, &virtio_hdr.hdr);
copy net hdr and mbuf to desc buf
}
foreach (mbuf) {
copy_mbuf_to_desc(mbuf);
}
Therefore, the memset at virtio_enqueue_offload() is not necessary
any more; remove it.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
Currently, default values of kickfd and callfd are -1.
If the values are -1, current code guesses kickfd and callfd haven't
been initialized yet. Then vhost library will guess the virtqueue isn't
ready for processing.
But callfd and kickfd will be set as -1 when "--enable-kvm"
isn't specified in QEMU command line. It means we cannot treat -1 as
uninitialized state.
The patch defines -1 and -2 as VIRTIO_INVALID_EVENTFD and
VIRTIO_UNINITIALIZED_EVENTFD, and uses VIRTIO_UNINITIALIZED_EVENTFD for
the default values of kickfd and callfd.
Signed-off-by: Tetsuya Mukawa <mukawa@igel.co.jp>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
If a malicious guest forges a dead loop chain, it could lead to a dead
loop of copying the desc buf to mbuf, which results to all mbuf being
exhausted.
Add a var nr_desc to avoid such case.
Suggested-by: Huawei Xie <huawei.xie@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
A malicious guest may easily forge some illegal vring desc buf.
To make our vhost robust, we need make sure desc->next will not
go beyond the vq->desc[] array.
Suggested-by: Rich Lane <rich.lane@bigswitch.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
We need make sure that desc->len is bigger than the size of virtio net
header, otherwise, unexpected behaviour might happen due to "desc_avail"
would become a huge number with for following code:
desc_avail = desc->len - vq->vhost_hlen;
For dequeue code path, it will try to allocate enough mbuf to hold such
size of desc buf, which ends up with consuming all mbufs, leading to no
free mbuf is available. Therefore, you might see an error message:
Failed to allocate memory for mbuf.
Also, for both dequeue/enqueue code path, while it copies data from/to
desc buf, the big "desc_avail" would result to access memory not belong
the desc buf, which could lead to some potential memory access errors.
A malicious guest could easily forge such malformed vring desc buf. Every
time we restart an interrupted DPDK application inside guest would also
trigger this issue, as all huge pages are reset to 0 during DPDK re-init,
leading to desc->len being 0.
Therefore, this patch does a sanity check for desc->len, to make vhost
robust.
Reported-by: Rich Lane <rich.lane@bigswitch.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
VIRTIO_NET_F_MRG_RXBUF is a default feature supported by vhost.
Adding unlikely for VIRTIO_NET_F_MRG_RXBUF detection doesn't
make sense to me at all.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
First of all, rte_memcpy() is mostly useful for copying big packets
by leveraging hardware advanced instructions like AVX. But for virtio
net hdr, which is 12 bytes at most, invoking rte_memcpy() will not
introduce any performance boost.
And, to my suprise, rte_memcpy() is VERY huge. Since rte_memcpy()
is inlined, it increases the binary code size linearly every time
we call it at a different place. Replacing the two rte_memcpy()
with directly copy saves nearly 12K bytes of code size!
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Current virtio_dev_merge_rx() implementation just looks like the
old rte_vhost_dequeue_burst(), full of twisted logic, that you
can see same code block in quite many different places.
However, the logic of virtio_dev_merge_rx() is quite similar to
virtio_dev_rx(). The big difference is that the mergeable one
could allocate more than one available entries to hold the data.
Fetching all available entries to vec_buf at once makes the
difference a bit bigger then.
The refactored code looks like below:
while (mbuf_has_not_drained_totally || mbuf_has_next) {
if (this_desc_has_no_room) {
this_desc = fetch_next_from_vec_buf();
if (it is the last of a desc chain)
update_used_ring();
}
if (this_mbuf_has_drained_totally)
mbuf = fetch_next_mbuf();
COPY(this_desc, this_mbuf);
}
This patch reduces quite many lines of code, therefore, make it much
more readable.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This is a simple refactor, as there isn't any twisted logic in old
code. Here I just broke the code and introduced two helper functions,
reserve_avail_buf() and copy_mbuf_to_desc() to make the code more
readable.
Also, it saves nearly 1K bytes of binary code size.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
The current rte_vhost_dequeue_burst() implementation is a bit messy
and logic twisted. And you could see repeat code here and there.
However, rte_vhost_dequeue_burst() acutally does a simple job: copy
the packet data from vring desc to mbuf. What's tricky here is:
- desc buff could be chained (by desc->next field), so that you need
fetch next one if current is wholly drained.
- One mbuf could not be big enough to hold all desc buff, hence you
need to chain the mbuf as well, by the mbuf->next field.
The simplified code looks like following:
while (this_desc_is_not_drained_totally || has_next_desc) {
if (this_desc_has_drained_totally) {
this_desc = next_desc();
}
if (mbuf_has_no_room) {
mbuf = allocate_a_new_mbuf();
}
COPY(mbuf, desc);
}
Note that the old patch does a special handling for skipping virtio
header. However, that could be simply done by adjusting desc_avail
and desc_offset var:
desc_avail = desc->len - vq->vhost_hlen;
desc_offset = vq->vhost_hlen;
This refactor makes the code much more readable (IMO), yet it reduces
binary code size.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Add DT_NEEDED entries for external library dependencies which
are the most critical ones for sane operation.
Clean up vhost_cuse CFLAGS/LDFLAGS confusion while at it.
Signed-off-by: Panu Matilainen <pmatilai@redhat.com>
vq is allocated on pairs, hence we should do pair reallocation
at numa_realloc() as well, otherwise an error like following
occurs while do numa reallocation:
VHOST_CONFIG: reallocate vq from 0 to 1 node
PANIC in rte_free():
Fatal error: Invalid memory
The reason we don't catch it is because numa_realloc() will
not take effect when RTE_LIBRTE_VHOST_NUMA is not enabled,
which is the default case.
Fixes: e049ca6d10 ("vhost-user: prepare multiple queue setup")
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
Tested-by: Ciara Loftus <ciara.loftus@intel.com>
We could first check if we need realloc vq or not, if so,
reallocate it. We then do similar to vhost dev realloc.
This could get rid of the tons of repeated "if (realloc_dev)"
and "if (realloc_vq)" statements, therefore, makes code
a bit more readable.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
While we use a single linked list to maintain all devices, we could
use a static array to achieve the same goal, just like what we did
to maintain the eth devices with rte_eth_devices array. This could
simplifies the code a bit.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
VIRTIO_NET_F_GUEST_ANNOUNCE is a new feature introduced since kernel
v3.5. For older kernels (or more precisely, old distributions), we
could simply define it manually, to fix the "macro not defined" error.
Fixes: d293dac8f3 ("vhost: claim support of guest announce")
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Broadcast RARP packet by injecting it to receiving mbuf array at
rte_vhost_dequeue_burst().
Commit 33226236a3 ("vhost: handle request to send RARP") iterates
all host interfaces and then broadcast it by all of them. It did
notify the switches about the new location of the migrated VM, however,
the mac learning table in the target host is wrong (at least in my
test with OVS):
$ ovs-appctl fdb/show ovsbr0
port VLAN MAC Age
1 0 b6:3c:72:71:cd:4d 10
LOCAL 0 b6:3c:72:71:cd:4e 10
LOCAL 0 52:54:00:12:34:68 9
1 0 56:f6:64:2c:bc:c0 1
Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
above, the port learned is "LOCAL", which is the "ovsbr0" port. That
is reasonable, since we indeed send the pkt by the "ovsbr0" interface.
The wrong mac table lead all the packets to the VM go to the "ovsbr0"
in the end, which ends up with all packets being lost, until the guest
send a ARP quest (or reply) to refresh the mac learning table.
Jianfeng then came up with a solution I have thought of firstly but NAKed
by myself, concerning it has potential issues [0]. The solution is as title
stated: broadcast the RARP packet by injecting it to the receiving mbuf
arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me
think it twice; it looked like a false concern to me then. And I had done
a rough verification: it worked as expected.
[0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html
Another note is that while preparing this version, I found that DPDK has
some ARP related structures and macros defined. So, use them instead of
the one from standard header files here.
Cc: Thibaut Collet <thibaut.collet@6wind.com>
Suggested-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Malfunctioning virtio clients may not send VHOST_USER_SET_MEM_TABLE for
some reason. This causes NULL dereference in qva_to_vva().
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
The vhost_net_device_ops indirection is unnecessary because there is only
one implementation of the vhost common code.
Removing it makes the code more readable.
Signed-off-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
The common vhost code only supported a single mmap per device. vhost-user
worked around this by saving the address/length/fd of each mmap after the end
of the rte_virtio_memory struct. This only works if the vhost-user code frees
dev->mem, since the common code is unaware of the extra info. The
VHOST_USER_RESET_OWNER message is one situation where the common code frees
dev->mem and leaks the fds and mappings. This happens every time I shut down a
VM.
The new code calls back into the implementation (vhost-user or vhost-cuse) to
clean up these resources.
The vhost-cuse changes are only compile tested.
Signed-off-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.
Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
While in former patch we enabled GUEST_ANNOUNCE feature, so that the
guest OS will broadcast a GARP message after migration to notify the
switch about the new location of migrated VM, the thing is that
GUEST_ANNOUNCE is enabled since kernel v3.5 only. For older kernel,
VHOST_USER_SEND_RARP request comes to rescue.
The payload of this new request is the mac address of the migrated VM,
with that, we could construct a RARP message, and then broadcast it
to host interfaces.
That's how this patch works:
- list all interfaces, with the help of SIOCGIFCONF ioctl command
- construct an RARP message and broadcast it
Cc: Thibaut Collet <thibaut.collet@6wind.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
It's actually a feature already enabled in Linux kernel (since v3.5).
What we need to do is simply to claim that we support such feature,
and nothing else.
With that, the guest will send an ARP message after live migration
to notify the switches about the new location of migrated VM.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
Every time we copy a buf to vring desc, we need to log it.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.
Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.
Every time we update virtio used ring, we need to log it. And it's
been done by a new vhost_log_write() wrapper, vhost_log_used_vring().
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.
This request introduces a new payload:
typedef struct VhostUserLog {
uint64_t mmap_size;
uint64_t mmap_offset;
} VhostUserLog;
Also, a fd is delivered from QEMU by ancillary data.
With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
Commit d0cf91303d added dependency on librte_net headers to vhost
but did not add this to the Makefile, which makes builds
non-deterministic. Curiously it is non-parallel build that is
consistently broken by this missing dependency, usually it's the other
way around, but trying to build without -j(n) fails with:
lib/librte_vhost/vhost_rxtx.c:41:20:
fatal error: rte_ip.h: No such file or directory
Fixes: d0cf91303d ("vhost: add Tx offload capabilities")
Signed-off-by: Panu Matilainen <pmatilai@redhat.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Add guest offload setting in vhost lib.
Virtio 1.0 spec (5.1.6.4 Processing of Incoming Packets) says:
1. If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the
VIRTIO_NET_HDR_F_NEEDS_CSUM bit in flags can be set: if so,
the packet checksum at offset csum_offset from csum_start
and any preceding checksums have been validated. The checksum
on the packet is incomplete and csum_start and csum_offset
indicate how to calculate it (see Packet Transmission point 1).
2. If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were
negotiated, then gso_type MAY be something other than
VIRTIO_NET_HDR_GSO_NONE, and gso_size field indicates the
desired MSS (see Packet Transmission point 2).
In order to support these features, the following changes are added,
1. Extend 'VHOST_SUPPORTED_FEATURES' macro to add the offload features negotiation.
2. Enqueue these offloads: convert some fields in mbuf to the fields in virtio_net_hdr.
There are more explanations for the implementation.
For VM2VM case, there is no need to do checksum, for we think the
data should be reliable enough, and setting VIRTIO_NET_HDR_F_NEEDS_CSUM
at RX side will let the TCP layer to bypass the checksum validation,
so that the RX side could receive the packet in the end.
In terms of us-vhost, at vhost RX side, the offload information is
inherited from mbuf, which is in turn inherited from TX side. If we
can still get those info at RX side, it means the packet is from
another VM at same host. So, it's safe to set the
VIRTIO_NET_HDR_F_NEEDS_CSUM, to skip checksum validation.
Signed-off-by: Jijiang Liu <jijiang.liu@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Add vhost TX offload (CSUM and TSO) support capabilities in vhost lib.
In order to support these features, and the following changes are added,
1. Extend 'VHOST_SUPPORTED_FEATURES' macro to add the offload features
negotiation.
2. Dequeue TX offload: convert the fileds in virtio_net_hdr to the
related fileds in mbuf.
Signed-off-by: Jijiang Liu <jijiang.liu@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
When guest is booting (or any othertime guest is busy) it is possible
for the small receive ring (256) to get full. If this happens the
vhost library should just return normally. It's current behavior
of logging just creates massive log spew/overflow which could even
act as a DoS attack against host.
Reported-by: Nathan Law <nlaw@brocade.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
So that we will not break ABI in future extension by adding few more
fields.
Struct vhost_virtqueue is reserved with 16 qwords (the later vhost-live
migration support would at least consume 3 of them), and struct virtio_net
is reserved with a bit more, 64 qwords, as there is only one instance for
a virtio nic instance.
Note that both reservation are not placed at the end of the struct, but
instead before the last field, since both the last field at the two struct
take a lot spaces. Putting the reservation after it would divide those
reserved fields to another cacheline. (we might need fix them in future, btw)
Suggested-by: Panu Matilainen <pmatilai@redhat.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Problem:if I firstly insert my kmod_test.ko, then insert eventfd_link.ko,
error will happen with hint "Device or resource busy". This is because
the default minor device number, 0, has been occupied by my kmod_test.ko .
root@distro:~/test$ lsmod
Module Size Used by
kmod_test 927 0
vboxsf 35930 4
vboxguest 222130 1 vboxsf
microcode 10315 0
autofs4 25051 0
root@distro:~/test$ insmod ./eventfd_link.ko
insmod: ERROR: could not insert module ./eventfd_link.ko: Device or
resource busy
Explanation: For miscdevices, the major device_no is same, so the minor
device_no should be set to ditinguish different misc devices; if not set
the minor, it may fail while insmod due to the default minor value, 0, has
been used by other miscdevice. MISC_DYNAMIC_MINOR means to let Linux
kernel dynamically assign one minor devide number while loading.
Signed-off-by: Xiaobo Chi <xiaobo.chi@nokia.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
The VHOST_USER_SET_VRING_ENABLE request was sent for each queue-pair.
However, it's changed to be sent per queue in the queue-pair by QEMU
commit dc3db6ad ("vhost-user: start/stop all rings"). The change
is reasonable, as we send all other request per queue, instead of
queue-pair.
Hence we should do proper changes to adapt to the QEMU change here.
Otherwise, a segfault will be triggered when last TX queue was enabled.
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
This patch fixes a bug under lower version linux kernel, mmap()
fails when length is not aligned with hugepage size. mmap()
without flag of MAP_ANONYMOUS, should be called with length
argument aligned with hugepagesz at older longterm version
Linux, like 2.6.32 and 3.2.72, or mmap() will fail with EINVAL.
This bug was fixed in Linux kernel by commit:
dab2d3dc45ae7343216635d981d43637e1cb7d45
To avoid failure, make sure in caller to keep length aligned.
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
The patch fixes reset_owner message handling not to clear callfd,
because callfd will be valid while connection is established.
Signed-off-by: Tetsuya Mukawa <mukawa@igel.co.jp>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Currently, we reset all fields of a device to zero when reset
happens, which is wrong, since for some fields like device_fh,
ifname, and virt_qp_nb, they should be same and be kept after
reset until the device is removed. And this is what's the new
helper function reset_device() for.
And use rte_zmalloc() instead of rte_malloc, so that we could
avoid init_device(), which basically dose zero reset only so far.
Hence, init_device() is dropped in this patch.
This patch also removes a hack of using the offset a specific
field (which is virtqueue now) inside of `virtio_net' structure
to do reset, which could be broken easily if someone changed the
field order without caution.
Cc: Tetsuya Mukawa <mukawa@igel.co.jp>
Cc: Huawei Xie <huawei.xie@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
QEMU sends VHOST_RESET_OWNER first when shutting down.
There was previously no way for the dataplane to know that the
virtio_net instance had become unusable and it would segfault
when trying to do RX/TX.
Signed-off-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Commit 15e9ee6982
uses the VIRTIO_F_VERSION_1 macro existing only in newer kernels.
Fixed it by manually defining it for older kernels.
Fixes: 15e9ee6982 ("vhost: enable virtio 1.0")
Reported-by: Qian Xu <qian.q.xu@intel.com>
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
Make vhost-user virtio 1.0 compatible by adding it to the
supported features and keeping the header length
the same as for mergeable RX buffers.
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>