170 Commits

Author SHA1 Message Date
Thomas Monjalon
f8e9cbe2aa mk: fix internal dependencies
Some libraries were missing their dependency on eal, mbuf, mempool,
ring and kvargs.
It is revealed by the linker option "-z defs".

Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com>
2016-06-29 13:33:01 +02:00
Huawei Xie
2cdc118eef vhost: check hugepage fstat error
Value returned from fstat is not checked for errors before being used.
This patch fixes following coverity issue.

    static uint64_t
    get_blk_size(int fd)
    {
    	struct stat stat;

    	fstat(fd, &stat);
    	return (uint64_t)stat.st_blksize;
    >>>  CID 107103 (#1 of 1): Unchecked return value from library
         (CHECKED_RETURN)
    >>>  check_return: Calling fstat(fd, &stat) without checking
         return value.
    >>>  This library function may fail and return an error code.

Fixes: 8f972312b8f4 ("vhost: support vhost-user")

Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-06-22 09:47:12 +02:00
Ilya Maximets
428261b461 vhost: unmap log memory on cleanup
Fixes memory leak on QEMU migration.

Fixes: 54f9e32305d4 ("vhost: handle dirty pages logging request")

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-06-22 09:47:12 +02:00
Ilya Maximets
53af5b1e0a vhost: fix leak of file descriptors
While migration of vhost-user device QEMU allocates memfd
to store information about dirty pages and sends fd to
vhost-user process.

File descriptor for this memory should be closed to prevent
"Too many open files" error for vhost-user process after
some amount of migrations.

Ex.:
 # ls /proc/<ovs-vswitchd pid>/fd/ -alh
 total 0
 root qemu  .
 root qemu  ..
 root qemu  0 -> /dev/pts/0
 root qemu  1 -> pipe:[1804353]
 root qemu  10 -> socket:[1782240]
 root qemu  100 -> /memfd:vhost-log (deleted)
 root qemu  1000 -> /memfd:vhost-log (deleted)
 root qemu  1001 -> /memfd:vhost-log (deleted)
 root qemu  1004 -> /memfd:vhost-log (deleted)
 [...]
 root qemu  996 -> /memfd:vhost-log (deleted)
 root qemu  997 -> /memfd:vhost-log (deleted)

 ovs-vswitchd.log:
 |WARN|punix:ovs-vswitchd.ctl: accept failed: Too many open files

Fixes: 54f9e32305d4 ("vhost: handle dirty pages logging request")

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-06-22 09:47:12 +02:00
Marcin Kerlin
b497724652 vhost: fix null pointer dereference
Return value of function get_device() is not checking before
dereference. Fix this problem by adding checking condition.

Coverity issue: 119262

Fixes: 77d20126b4c2 ("vhost-user: handle message to enable vring")

Signed-off-by: Marcin Kerlin <marcinx.kerlin@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-06-22 09:47:12 +02:00
Huawei Xie
39449e7429 vhost: remove concurrent enqueue
All other DPDK PMDs doesn't support concurrent receiving or sending
packets to the same queue. The upper application should deal with
this, normally through queue and core bindings.

Due to historical reason, vhost internally supports concurrent lockless
enqueuing packets to the same virtio queue through costly cmpset operation.
This patch removes this internal lockless implementation and should improve
performance a bit.

Luckily DPDK OVS doesn't rely on this behavior.

Signed-off-by: Huawei Xie <huawei.xie@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-06-22 09:47:12 +02:00
Yuanhan Liu
a66bcad322 vhost: arrange struct fields for better cache sharing
The ifname[] field takes so much space, that it seperates some frequently
used fields into different caches, say, features and broadcast_rarp.

This patch moves all those fields that will be accessed frequently in Rx/Tx
together (before the ifname[] field) to let them share one cache line.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:47:12 +02:00
Yuanhan Liu
1d41d77cf8 vhost: optimize dequeue for small packets
A virtio driver normally uses at least 2 desc buffers for Tx: the
first for storing the header, and the others for storing the data.

Therefore, we could fetch the first data desc buf before the main
loop, and do the copy first before the check of "are we done yet?".
This could save one check for small packets that just have one data
desc buffer and need one mbuf to store it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:47:12 +02:00
Yuanhan Liu
7f74b95c44 vhost: pre update used ring for Tx and Rx
Pre update and update used ring in batch for Tx and Rx at the stage
while fetching all avail desc idx. This would reduce some cache misses
and hence, increase the performance a bit.

Pre update would be feasible as guest driver will not start processing
those entries as far as we don't update "used->idx". (I'm not 100%
certain I don't miss anything, though).

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:47:12 +02:00
Yuanhan Liu
0823c1cb0a vhost: workaround stale vring base
When DPDK app crashes (or quits, or gets killed), a restart of DPDK
app would get stale vring base from QEMU. That would break the kernel
virtio net completely, making it non-work any more, unless a driver
reset is done.

So, instead of getting the stale vring base from QEMU, Huawei suggested
we could get a much saner (and may not the most accurate) vring base
from used->idx. That would work because:

- there is a memory barrier between updating used ring entries and
  used->idx. So, even though we crashed at updating the used ring
  entries, it will not cause any issue, as the guest driver will not
  process those stale used entries, for used-idx is not updated yet.

- DPDK process vring in order, that means a crash may just lead some
  packet retransmission for Tx and drop for Rx.

Suggested-by: Huawei Xie <huawei.xie@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
2016-06-22 09:47:12 +02:00
Yuanhan Liu
e623e0c6d8 vhost: add reconnect ability
Allow reconnecting on failure by default when:

- DPDK app starts first and QEMU (as the server) is not started yet.
  Without reconnecting, DPDK app would simply fail on vhost-user
  registration.

- QEMU restarts, say due to OS reboot.
  Without reconnecting, you can't re-establish the connection without
  restarting DPDK app.

This patch make it work well for both above cases. It simply creates
a new thread, and keep trying calling "connect()", until it succeeds.

The reconnect could be disabled when RTE_VHOST_USER_NO_RECONNECT flag
is set.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-06-22 09:47:12 +02:00
Yuanhan Liu
64ab701c3d vhost: add vhost-user client mode
Add a new paramter (flags) to rte_vhost_driver_register(). DPDK
vhost-user acts as client mode when RTE_VHOST_USER_CLIENT flag
is set.  The flags would also allow future extensions without
breaking the API (again).

The rest is straingfoward then: allocate a unix socket, and
bind/listen for server, connect for client.

This extension is for vhost-user only, therefore we simply quit
and report error when any flags are given for vhost-cuse.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-06-22 09:47:07 +02:00
Yuanhan Liu
9ebcd4f9c7 vhost: rename structs for enabling client mode
DPDK vhost-user just acts as server so far, so, using a struct named
as "vhost_server" is okay. However, if we add client mode, it doesn't
make sense any more. Here renames it to "vhost_user_socket".

There was no obvious wrong about "connfd_ctx", but I think it's obviously
better to rename it to "vhost_user_connection", as it does represent
a connection, a connection between the backend (DPDK) and the frontend
(QEMU).

Similarly, few more renames are taken, such as "vserver_new_vq_conn"
to "vhost_user_new_connection".

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-06-22 09:44:21 +02:00
Ilya Maximets
7fd5dde987 vhost: make buffer vector for scatter Rx local
Array of buf_vector's is just an array for temporary storing information
about available descriptors. It used only locally in virtio_dev_merge_rx()
and there is no reason for that array to be shared.

Fix that by allocating local buf_vec inside virtio_dev_merge_rx().

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:44:21 +02:00
Yuanhan Liu
e197bd630a vhost: make virtio header length per device
Virtio net header length is set per device, but not per queue. So, there
is no reason to store it in vhost_virtqueue struct, instead, we should
store it in virtio_net struct, to make one copy only.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:44:20 +02:00
Yuanhan Liu
004b8ca8b5 vhost: reserve few more space for future extension
"virtio_net_device_ops" is the only left open struct that an application
can access, therefore, it's the only place that might introduce potential
ABI break in future for extension.

So, do some reservation for it. 5 should be pretty enough, considering
that we have barely touched it for a long while. Another reason to
choose 5 is for cache alignment: 5 makes the struct 64 bytes for 64 bit
machine.

With this, it's confidence to say that we might be able to be free from
the ABI violation forever.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:43:01 +02:00
Yuanhan Liu
f0fa04e35e vhost: remove virtio-net.h
It barely has anything useful there, just 2 functions prototype. Here
move them to vhost-net.h, and delete it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:43:01 +02:00
Yuanhan Liu
758e3471b4 vhost: remove unnecessary fields
The "reserved" field in virtio_net and vhost_virtqueue struct is not
necessary any more. We now expose virtio_net device with a number "vid".

This patch also removes the "priv" field: all fields are priviate now:
application can't access it now. The only way that we could still access
it is to expose it by a function, but I doubt that's needed or worthwhile.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:43:01 +02:00
Yuanhan Liu
db69be54b6 vhost: hide internal code
We are now safe to move all those internal structs/macros/functions to
vhost-net.h, to hide them from external access.

This patch also breaks long lines and removes some redundant comments.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:43:01 +02:00
Yuanhan Liu
4ecf22e356 vhost: export device id as the interface to applications
With all the previous prepare works, we are just one step away from
the final ABI refactoring. That is, to change current API to let them
stick to vid instead of the old virtio_net dev.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:42:57 +02:00
Yuanhan Liu
a67f286a65 vhost: export queue free entries
The new API rte_vhost_avail_entries() is actually a rename of
rte_vring_available_entries(), with the "vring" to "vhost" name
change to keep the consistency of other vhost exported APIs.

This change could let us avoid the dependency of "virtio_net"
struct, to prepare for the ABI refactoring.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:02:58 +02:00
Yuanhan Liu
f6d1bd5365 vhost: export interface name
Introduce a new API rte_vhost_get_ifname() to export the ifname to
application.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:01:30 +02:00
Yuanhan Liu
4b4af666b9 vhost: export number of queues
Introduce a new API rte_vhost_get_queue_num() to export the number of
queues.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:01:30 +02:00
Yuanhan Liu
586e390013 vhost: export numa node
Introduce a new API rte_vhost_get_numa_node() to get the numa node
from which the virtio_net struct is allocated.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:01:30 +02:00
Yuanhan Liu
adf97191b3 vhost: move cuse only struct to cuse
vhost cuse is now the last reference of the vhost_device_ctx struct;
move it there, and do a rename to "vhost_cuse_device_ctx", to make
it clear that it's "cuse only".

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:01:30 +02:00
Yuanhan Liu
319d362e3b vhost: get device by device id only
get_device() just needs vid, so pass vid as the parameter only.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:01:30 +02:00
Yuanhan Liu
e2a1dd1275 vhost: rename device id variable
I failed to figure out what does "fh" mean here for a long while.
The only guess I could have had is "file handle". So, you get the
point that it's not well named.

I then figured it out that "fh" is derived from the fuse lib, and
my above guess is right. However, device_fh represents a virtio
net device ID. Therefore, here I rename it to vid (Virtio-net device
ID, or Vhost device ID; choose one you prefer) to make it easier for
understanding.

This name (vid) then will be considered to the only interface to
applications. That's another reason to do the rename: it's our
interface, make it more understandable.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 09:01:25 +02:00
Yuanhan Liu
c08a349006 vhost: declare device id as int
device_fh repsents the device id for a specific virtio net device.
Firstly, "int" would be big enough: we don't need 64 bit. Secondly,
this could let us avoid the ugly "%" PRIu64 ".." stuff.

And since ctx.fh is derived from device_fh, declare it as int, too.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 08:59:54 +02:00
Yuanhan Liu
550c9d27d1 vhost: set/reset device flags internally
It does not make sense to ask the application to set/unset the flag
VIRTIO_DEV_RUNNING (that used internal only) at new_device()/
destroy_device() callback.

Instead, it should be set after new_device() succeeds and reset before
destroy_device() is invoked inside vhost lib. This patch fixes it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 06:10:54 +02:00
Yuanhan Liu
092f1c2c77 vhost: declare backend with int type
It's an fd; so define it as "int", which could also save the unncessary
(int) case.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Rich Lane <rich.lane@bigswitch.com>
Acked-by: Rich Lane <rich.lane@bigswitch.com>
2016-06-22 06:10:54 +02:00
Daniel Mrzyglod
a5e20775a7 vhost: fix name not null terminated
Fix issue reported by Coverity.
Coverity ID 124556

If the buffer is treated as a null terminated string in later operations,
a buffer overflow or over-read may occur.

In vhost_set_ifname: The string buffer may not have a null terminator if
the source string's length is equal to the buffer size

Fixes: 54292e9520e0 ("vhost: support ifname for vhost-user")

Signed-off-by: Daniel Mrzyglod <danielx.t.mrzyglod@intel.com>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-05-10 20:25:30 +02:00
Yuanhan Liu
71dc571efd vhost: fix error handling in destroy
Fix following coverity defect:

    291     void
    292     vhost_destroy_device(struct vhost_device_ctx ctx)
    293     {
    294             struct virtio_net *dev = get_device(ctx);
    295
    >>>     CID 124565:  Null pointer dereferences  (NULL_RETURNS)
    >>>     Dereferencing a null pointer "dev".

Fixes: 45ca9c6f7bc6 ("vhost: get rid of linked list for devices")

Reported-by: John McNamara <john.mcnamara@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-04-06 12:27:57 +02:00
Ilya Maximets
e994bcda55 vhost: use SMP barriers instead of compiler ones
Since commit 4c02e453cc62 ("eal: introduce SMP memory barriers") virtio
uses architecture dependent SMP barriers. vHost should use them too.

Fixes: 4c02e453cc62 ("eal: introduce SMP memory barriers")

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
2016-03-31 17:09:23 +02:00
Yuanhan Liu
cce3ce3567 vhost: remove unnecessary return
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-25 19:53:00 +01:00
Yuanhan Liu
b3869ebebf vhost: remove unnecessary memset when enqueueing
We have to reset the virtio net hdr at virtio_enqueue_offload()
before, due to all mbufs share a single virtio_hdr structure:

	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, }, 0};

	foreach (mbuf) {
		virtio_enqueue_offload(mbuf, &virtio_hdr.hdr);

		copy net hdr and mbuf to desc buf
	}

However, after the vhost rxtx refactor, the code looks like:

	copy_mbuf_to_desc(mbuf)
	{
		struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, }, 0}

		virtio_enqueue_offload(mbuf, &virtio_hdr.hdr);

		copy net hdr and mbuf to desc buf
	}

	foreach (mbuf) {
		copy_mbuf_to_desc(mbuf);
	}

Therefore, the memset at virtio_enqueue_offload() is not necessary
any more; remove it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
2016-03-17 21:53:06 +01:00
Tetsuya Mukawa
fb871d0a4d vhost: fix default value of kickfd and callfd
Currently, default values of kickfd and callfd are -1.
If the values are -1, current code guesses kickfd and callfd haven't
been initialized yet. Then vhost library will guess the virtqueue isn't
ready for processing.

But callfd and kickfd will be set as -1 when "--enable-kvm"
isn't specified in QEMU command line. It means we cannot treat -1 as
uninitialized state.

The patch defines -1 and -2 as VIRTIO_INVALID_EVENTFD and
VIRTIO_UNINITIALIZED_EVENTFD, and uses VIRTIO_UNINITIALIZED_EVENTFD for
the default values of kickfd and callfd.

Signed-off-by: Tetsuya Mukawa <mukawa@igel.co.jp>
Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-15 00:20:29 +01:00
Yuanhan Liu
a436f53ebf vhost: avoid dead loop chain
If a malicious guest forges a dead loop chain, it could lead to a dead
loop of copying the desc buf to mbuf, which results to all mbuf being
exhausted.

Add a var nr_desc to avoid such case.

Suggested-by: Huawei Xie <huawei.xie@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-15 00:07:32 +01:00
Yuanhan Liu
c687b0b635 vhost: check for ring descriptors overflow
A malicious guest may easily forge some illegal vring desc buf.
To make our vhost robust, we need make sure desc->next will not
go beyond the vq->desc[] array.

Suggested-by: Rich Lane <rich.lane@bigswitch.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-15 00:05:59 +01:00
Yuanhan Liu
623bc47054 vhost: do sanity check for ring descriptor length
We need make sure that desc->len is bigger than the size of virtio net
header, otherwise, unexpected behaviour might happen due to "desc_avail"
would become a huge number with for following code:

	desc_avail  = desc->len - vq->vhost_hlen;

For dequeue code path, it will try to allocate enough mbuf to hold such
size of desc buf, which ends up with consuming all mbufs, leading to no
free mbuf is available. Therefore, you might see an error message:

	Failed to allocate memory for mbuf.

Also, for both dequeue/enqueue code path, while it copies data from/to
desc buf, the big "desc_avail" would result to access memory not belong
the desc buf, which could lead to some potential memory access errors.

A malicious guest could easily forge such malformed vring desc buf. Every
time we restart an interrupted DPDK application inside guest would also
trigger this issue, as all huge pages are reset to 0 during DPDK re-init,
leading to desc->len being 0.

Therefore, this patch does a sanity check for desc->len, to make vhost
robust.

Reported-by: Rich Lane <rich.lane@bigswitch.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-15 00:03:46 +01:00
Yuanhan Liu
c252bcf9ec vhost: remove wrong unlikely prediction in Rx
VIRTIO_NET_F_MRG_RXBUF is a default feature supported by vhost.
Adding unlikely for VIRTIO_NET_F_MRG_RXBUF detection doesn't
make sense to me at all.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-14 23:59:47 +01:00
Yuanhan Liu
a98240621d vhost: remove rte_memcpy from header copy
First of all, rte_memcpy() is mostly useful for copying big packets
by leveraging hardware advanced instructions like AVX. But for virtio
net hdr, which is 12 bytes at most, invoking rte_memcpy() will not
introduce any performance boost.

And, to my suprise, rte_memcpy() is VERY huge. Since rte_memcpy()
is inlined, it increases the binary code size linearly every time
we call it at a different place. Replacing the two rte_memcpy()
with directly copy saves nearly 12K bytes of code size!

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-14 23:58:11 +01:00
Yuanhan Liu
932a00b85a vhost: refactor mergeable Rx
Current virtio_dev_merge_rx() implementation just looks like the
old rte_vhost_dequeue_burst(), full of twisted logic, that you
can see same code block in quite many different places.

However, the logic of virtio_dev_merge_rx() is quite similar to
virtio_dev_rx().  The big difference is that the mergeable one
could allocate more than one available entries to hold the data.
Fetching all available entries to vec_buf at once makes the
difference a bit bigger then.

The refactored code looks like below:

	while (mbuf_has_not_drained_totally || mbuf_has_next) {
		if (this_desc_has_no_room) {
			this_desc = fetch_next_from_vec_buf();

			if (it is the last of a desc chain)
				update_used_ring();
		}

		if (this_mbuf_has_drained_totally)
			mbuf = fetch_next_mbuf();

		COPY(this_desc, this_mbuf);
	}

This patch reduces quite many lines of code, therefore, make it much
more readable.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-14 23:56:41 +01:00
Yuanhan Liu
282a94ba99 vhost: refactor Rx
This is a simple refactor, as there isn't any twisted logic in old
code. Here I just broke the code and introduced two helper functions,
reserve_avail_buf() and copy_mbuf_to_desc() to make the code more
readable.

Also, it saves nearly 1K bytes of binary code size.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-14 23:55:06 +01:00
Yuanhan Liu
bc7f87a2c1 vhost: refactor dequeueing
The current rte_vhost_dequeue_burst() implementation is a bit messy
and logic twisted. And you could see repeat code here and there.

However, rte_vhost_dequeue_burst() acutally does a simple job: copy
the packet data from vring desc to mbuf. What's tricky here is:

- desc buff could be chained (by desc->next field), so that you need
  fetch next one if current is wholly drained.

- One mbuf could not be big enough to hold all desc buff, hence you
  need to chain the mbuf as well, by the mbuf->next field.

The simplified code looks like following:

	while (this_desc_is_not_drained_totally || has_next_desc) {
		if (this_desc_has_drained_totally) {
			this_desc = next_desc();
		}

		if (mbuf_has_no_room) {
			mbuf = allocate_a_new_mbuf();
		}

		COPY(mbuf, desc);
	}

Note that the old patch does a special handling for skipping virtio
header. However, that could be simply done by adjusting desc_avail
and desc_offset var:

	desc_avail  = desc->len - vq->vhost_hlen;
	desc_offset = vq->vhost_hlen;

This refactor makes the code much more readable (IMO), yet it reduces
binary code size.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-14 23:49:36 +01:00
Panu Matilainen
0d822b8047 mk: fix vhost shared library dependencies
Add DT_NEEDED entries for external library dependencies which
are the most critical ones for sane operation.
Clean up vhost_cuse CFLAGS/LDFLAGS confusion while at it.

Signed-off-by: Panu Matilainen <pmatilai@redhat.com>
2016-03-13 20:27:26 +01:00
Yuanhan Liu
fd2fca6f6b vhost: fix queue pair reallocation
vq is allocated on pairs, hence we should do pair reallocation
at numa_realloc() as well, otherwise an error like following
occurs while do numa reallocation:

    VHOST_CONFIG: reallocate vq from 0 to 1 node
    PANIC in rte_free():
    Fatal error: Invalid memory

The reason we don't catch it is because numa_realloc() will
not take effect when RTE_LIBRTE_VHOST_NUMA is not enabled,
which is the default case.

Fixes: e049ca6d10e0 ("vhost-user: prepare multiple queue setup")

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
Tested-by: Ciara Loftus <ciara.loftus@intel.com>
2016-03-11 16:49:20 +01:00
Yuanhan Liu
78ffdaff06 vhost: simplify numa reallocation
We could first check if we need realloc vq or not, if so,
reallocate it. We then do similar to vhost dev realloc.

This could get rid of the tons of repeated "if (realloc_dev)"
and "if (realloc_vq)" statements, therefore, makes code
a bit more readable.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
2016-03-11 16:49:17 +01:00
Yuanhan Liu
45ca9c6f7b vhost: get rid of linked list for devices
While we use a single linked list to maintain all devices, we could
use a static array to achieve the same goal, just like what we did
to maintain the eth devices with rte_eth_devices array. This could
simplifies the code a bit.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Huawei Xie <huawei.xie@intel.com>
2016-03-11 16:46:18 +01:00
Yuanhan Liu
6fe390eed0 vhost: fix build with kernel < 3.5
VIRTIO_NET_F_GUEST_ANNOUNCE is a new feature introduced since kernel
v3.5. For older kernels (or more precisely, old distributions), we
could simply define it manually, to fix the "macro not defined" error.

Fixes: d293dac8f30e ("vhost: claim support of guest announce")

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-03-11 16:46:18 +01:00
Yuanhan Liu
bb66588304 vhost: broadcast RARP by injecting in receiving mbuf array
Broadcast RARP packet by injecting it to receiving mbuf array at
rte_vhost_dequeue_burst().

Commit 33226236a35e ("vhost: handle request to send RARP") iterates
all host interfaces and then broadcast it by all of them.  It did
notify the switches about the new location of the migrated VM, however,
the mac learning table in the target host is wrong (at least in my
test with OVS):

    $ ovs-appctl fdb/show ovsbr0
     port  VLAN  MAC                Age
        1     0  b6:3c:72:71:cd:4d   10
    LOCAL     0  b6:3c:72:71:cd:4e   10
    LOCAL     0  52:54:00:12:34:68    9
        1     0  56:f6:64:2c:bc:c0    1

Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
above, the port learned is "LOCAL", which is the "ovsbr0" port. That
is reasonable, since we indeed send the pkt by the "ovsbr0" interface.

The wrong mac table lead all the packets to the VM go to the "ovsbr0"
in the end, which ends up with all packets being lost, until the guest
send a ARP quest (or reply) to refresh the mac learning table.

Jianfeng then came up with a solution I have thought of firstly but NAKed
by myself, concerning it has potential issues [0]. The solution is as title
stated: broadcast the RARP packet by injecting it to the receiving mbuf
arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me
think it twice; it looked like a false concern to me then. And I had done
a rough verification: it worked as expected.

[0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html

Another note is that while preparing this version, I found that DPDK has
some ARP related structures and macros defined. So, use them instead of
the one from standard header files here.

Cc: Thibaut Collet <thibaut.collet@6wind.com>
Suggested-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
2016-02-29 16:55:30 +01:00