numam-dpdk

Author	SHA1	Message	Date
Maxime Coquelin	fb3815cc61	vhost: handle virtually non-contiguous buffers in Rx-mrg This patch enables the handling of buffers non-contiguous in process virtual address space in the enqueue path when mergeable buffers are used. When virtio-net header doesn't fit in a single chunck, it is computed in a local variable and copied to the buffer chuncks afterwards. For packet content, the copy length is limited to the chunck size, next chuncks VAs being fetched afterward. This issue has been assigned CVE-2018-1059. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-04-23 17:12:13 +02:00
Maxime Coquelin	6727f5a739	vhost: handle virtually non-contiguous buffers in Rx This patch enables the handling of buffers non-contiguous in process virtual address space in the enqueue path when mergeable buffers aren't used. When virtio-net header doesn't fit in a single chunck, it is computed in a local variable and copied to the buffer chuncks afterwards. For packet content, the copy length is limited to the chunck size, next chuncks VAs being fetched afterward. This issue has been assigned CVE-2018-1059. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-04-23 17:12:13 +02:00
Maxime Coquelin	91b7b40806	vhost: handle virtually non-contiguous buffers in Tx This patch enables the handling of buffers non-contiguous in process virtual address space in the dequeue path. When virtio-net header doesn't fit in a single chunck, it is copied into a local variablei before being processed. For packet content, the copy length is limited to the chunck size, next chuncks VAs being fetched afterward. This issue has been assigned CVE-2018-1059. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-04-23 17:12:13 +02:00
Maxime Coquelin	d0c24508e1	vhost: add support for non-contiguous indirect descs tables This patch adds support for non-contiguous indirect descriptor tables in VA space. When it happens, which is unlikely, a table is allocated and the non-contiguous content is copied into it. This issue has been assigned CVE-2018-1059. Reported-by: Yongji Xie <xieyongji@baidu.com> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-04-23 16:04:30 +02:00
Maxime Coquelin	070aceda33	vhost: check all range is mapped when translating GPAs There is currently no check done on the length when translating guest addresses into host virtual addresses. Also, there is no guanrantee that the guest addresses range is contiguous in the host virtual address space. This patch prepares vhost_iova_to_vva() and its callers to return and check the mapped size. If the mapped size is smaller than the requested size, the caller handle it as an error. This issue has been assigned CVE-2018-1059. Reported-by: Yongji Xie <xieyongji@baidu.com> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-04-23 16:04:30 +02:00
Maxime Coquelin	c6ae7de0de	vhost: fix indirect descriptors table translation size This patch fixes the size passed at the indirect descriptor table translation time, which is the len field of the descriptor, and not a single descriptor. This issue has been assigned CVE-2018-1059. Fixes: `62fdb8255a` ("vhost: use the guest IOVA to host VA helper") Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-04-23 16:04:30 +02:00
Tomasz Kulasek	06fc115977	vhost: fix log macro name conflict LOG_DEBUG is a symbol defined by POSIX, so if sys/log.h is included the symbols conflict. This patch changes LOG_DEBUG to VHOST_LOG_DEBUG. Fixes: `1c01d52392` ("vhost: add debug print") Cc: stable@dpdk.org Signed-off-by: Ben Walker <benjamin.walker@intel.com> Signed-off-by: Tomasz Kulasek <tomaszx.kulasek@intel.com> Reviewed-by: Jianfeng Tan <jianfeng.tan@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-03-30 14:08:42 +02:00
Stefan Hajnoczi	1c717af4c6	vhost: add flag for built-in virtio driver The librte_vhost API is used in two ways: 1. As a vhost net device backend via rte_vhost_enqueue/dequeue_burst(). 2. As a library for implementing vhost device backends. There is no distinction between the two at the API level or in the librte_vhost implementation. For example, device state is kept in "struct virtio_net" regardless of whether this is actually a net device backend or whether the built-in virtio_net.c driver is in use. The virtio_net.c driver should be a librte_vhost API client just like the vhost-scsi code and have no special access to vhost.h internals. Unfortunately, fixing this requires significant librte_vhost API changes. This patch takes a different approach: keep the librte_vhost API unchanged but track whether the built-in virtio_net.c driver is in use. See the next patch for a bug fix that requires knowledge of whether virtio_net.c is in use. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2018-02-05 15:11:07 +01:00
Victor Kaplansky	a368804699	vhost: protect active rings from async ring changes When performing live migration or memory hot-plugging, the changes to the device and vrings made by message handler done independently from vring usage by PMD threads. This causes for example segfaults during live-migration with MQ enable, but in general virtually any request sent by qemu changing the state of device can cause problems. These patches fixes all above issues by adding a spinlock to every vring and requiring message handler to start operation only after ensuring that all PMD threads related to the device are out of critical section accessing the vring data. Each vring has its own lock in order to not create contention between PMD threads of different vrings and to prevent performance degradation by scaling queue pair number. See https://bugzilla.redhat.com/show_bug.cgi?id=1450680 Cc: stable@dpdk.org Signed-off-by: Victor Kaplansky <victork@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2018-01-21 15:51:52 +01:00
Junjie Chen	3ebd930588	vhost: fix mbuf free dequeue zero copy change buf_addr and buf_iova of mbuf, and return to mbuf pool without restore them, it breaks vm memory if others allocate mbuf from same pool since mbuf reset doesn't reset buf_addr and buf_iova. Fixes: `b0a985d1f3` ("vhost: add dequeue zero copy") Cc: stable@dpdk.org Signed-off-by: Junjie Chen <junjie.j.chen@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2018-01-21 15:51:52 +01:00
Xiao Wang	c09141e56f	net: fix RARP generation Due to a mistake operation from me, older version (v10) was merged to master branch. It's the v11 should be applied. However, the master branch is not rebase-able. Thus, this patch is made, from the diff between v10 and v11. The diffs are: - Add check for parameter and tailroom in rte_net_make_rarp_packet - Allocate mbuf in rte_net_make_rarp_packet Besides that, a link error is fixed when shared lib is enabled. Fixes: `45ae05df82` ("net: add a helper for making RARP packet") Fixes: `c3ffdba0e8` ("vhost: use API to make RARP packet") Signed-off-by: Xiao Wang <xiao.w.wang@intel.com> Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org>	2018-01-21 15:51:52 +01:00
Xiao Wang	c3ffdba0e8	vhost: use API to make RARP packet Signed-off-by: Xiao Wang <xiao.w.wang@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2018-01-16 18:47:49 +01:00
Junjie Chen	e37ff95440	vhost: support virtqueue interrupt/notification suppression The driver can suppress interrupt when VIRTIO_F_EVENT_IDX feature bit is negotiated. The driver set vring flags to 0, and MAY use used_event in available ring to advise device interrupt util reach an index specified by used_event. The device ignore the lower bit of vring flags, and send an interrupt when index reach used_event. The device can suppress notification in a manner analogous to the ways driver suppress interrupt. The device manipulates flags or avail_event in the used ring in the same way the driver manipulates flags or used_event in available ring. Signed-off-by: Junjie Chen <junjie.j.chen@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Lei Yao <lei.a.yao@intel.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2018-01-16 18:47:49 +01:00
Jiayu Hu	6d18505efa	vhost: support UDP Fragmentation Offload In virtio, UDP Fragmentation Offload (UFO) includes two parts: host UFO and guest UFO. Guest UFO means the frontend can receive large UDP packets, and host UFO means the backend can receive large UDP packets. This patch supports host UFO and guest UFO for vhost-user. Signed-off-by: Jiayu Hu <jiayu.hu@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Lei Yao <lei.a.yao@intel.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2018-01-16 18:47:49 +01:00
Stefan Hajnoczi	413a8fee30	vhost: add vring call helper Extract the callfd eventfd signal operation so virtio_net.c does not have to repeat it multiple times. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2018-01-16 18:47:49 +01:00
Junjie Chen	803aeecef1	vhost: fix dequeue zero copy with virtio1 This fix dequeue zero copy can not work with Qemu version >= 2.7. Since from Qemu 2.7 virtio device use virtio-1 protocol, the zero copy code path forget to add offset to buffer address. Fixes: `b0a985d1f3` ("vhost: add dequeue zero copy") Cc: stable@dpdk.org Signed-off-by: Junjie Chen <junjie.j.chen@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2018-01-16 18:47:49 +01:00
Bruce Richardson	369991d997	lib: use SPDX tag for Intel copyright files Replace the BSD license header with the SPDX tag for files with only an Intel copyright on them. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>	2018-01-04 22:41:39 +01:00
Santosh Shukla	455da54539	mbuf: rename physical address to IOVA Rename buf_physaddr to buf_iova. Keep the deprecated name in an anonymous union to avoid breaking the API. Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: Thomas Monjalon <thomas@monjalon.net> Acked-by: Olivier Matz <olivier.matz@6wind.com>	2017-11-06 22:44:26 +01:00
Tiwei Bie	1d8161ba02	vhost: fix dequeue offload support When offload is enabled, vhost needs to access the first mbuf to get the packet info, e.g. TCP header. So we couldn't delay the data copy in this case. Fixes: `e5c494a7a2` ("vhost: batch small guest memory copies") Reported-by: Lei Yao <lei.a.yao@intel.com> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2017-10-24 21:31:23 +02:00
Maxime Coquelin	eefac9536a	vhost: postpone device creation until rings are mapped Translating the start addresses of the rings is not enough, we need to be sure all the ring is made available by the guest. It depends on the size of the rings, which is not known on SET_VRING_ADDR reception. Furthermore, we need to be be safe against vring pages invalidates. This patch introduces a new access_ok flag per virtqueue, which is set when all the rings are mapped, and cleared as soon as a page used by a ring is invalidated. The invalidation part is implemented in a following patch. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2017-10-10 15:52:27 +02:00
Maxime Coquelin	62fdb8255a	vhost: use the guest IOVA to host VA helper Replace rte_vhost_gpa_to_vva() calls with vhost_iova_to_vva(), which requires to also pass the mapped len and the access permissions needed. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2017-10-10 15:52:27 +02:00
Maxime Coquelin	25bf7a0b09	vhost: make error handling consistent in Rx path In the non-mergeable receive case, when copy_mbuf_to_desc() call fails the packet is skipped, the corresponding used element len field is set to vnet header size, and it continues with next packet/desc. It could be a problem because it does not know why it failed, and assume the desc buffer is large enough. In mergeable receive case, when copy_mbuf_to_desc_mergeable() fails, packets burst is simply stopped. This patch makes the non-mergeable error path to behave as the mergeable one, as it seems the safest way. Also, doing this way will simplify pending IOTLB miss requests handling. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2017-10-10 15:52:27 +02:00
Tiwei Bie	e5c494a7a2	vhost: batch small guest memory copies This patch adaptively batches the small guest memory copies. By batching the small copies, the efficiency of executing the memory LOAD instructions can be improved greatly, because the memory LOAD latency can be effectively hidden by the pipeline. We saw great performance boosts for small packets PVP test. This patch improves the performance for small packets, and has distinguished the packets by size. So although the performance for big packets doesn't change, it makes it relatively easy to do some special optimizations for the big packets too. Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com> Acked-by: Yuanhan Liu <yliu@fridaylinux.org>	2017-10-10 15:48:53 +02:00
Ivan Dyukov	c665d9a231	vhost: fix checking of device features To compare enabled features in current device we must use bit mask instead of bit position. Fixes: `c843af3aa1` ("vhost: access header only if offloading is supported") Cc: stable@dpdk.org Signed-off-by: Ivan Dyukov <i.dyukov@samsung.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2017-07-02 01:38:39 +02:00
Jianfeng Tan	b08b8cfeb2	vhost: fix IP checksum There is no way to bypass IP checksum verification in Linux kernel, no matter skb->ip_summed is assigned as CHECKSUM_UNNECESSARY or CHECKSUM_PARTIAL. So any packets with bad IP checksum will be dropped at VM IP layer. To correct, we check this flag PKT_TX_IP_CKSUM to calculate IP csum. Fixes: `859b480d5a` ("vhost: add guest offload setting") Cc: stable@dpdk.org Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2017-07-02 01:28:34 +02:00
Jianfeng Tan	46b7a8372d	vhost: fix TCP checksum As PKT_TX_TCP_SEG flag in mbuf->ol_flags implies PKT_TX_TCP_CKSUM, applications, e.g., testpmd, don't set PKT_TX_TCP_CKSUM when TSO is set. This leads to that packets get dropped in VM tcp stack layer because of bad TCP csum. To fix this, we make sure TCP NEEDS_CSUM info is set into virtio net header when PKT_TX_TCP_SEG is set, so that VM tcp stack will not check the TCP csum. Fixes: `859b480d5a` ("vhost: add guest offload setting") Cc: stable@dpdk.org Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2017-07-02 01:28:22 +02:00
Jerin Jacob	98a7ea332b	fix typos using codespell utility Fixing typos across dpdk source code using codespell utility. Skipped the ethdev driver's base code fixes to keep the base code intact. Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Acked-by: John McNamara <john.mcnamara@intel.com>	2017-06-14 23:54:13 +02:00
Jerin Jacob	c0583d98a9	eal: introduce macro for always inline Different drivers use internal macros like force_inline for compiler always inline feature. Standardizing it through __rte_always_inline macro. Verified the change by comparing the output binary file. No difference found in the output binary file with this change. Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com>	2017-06-06 17:21:55 +02:00
Yuanhan Liu	84ad6e4491	vhost: fix dequeue zero copy For zero copy mode, we need pin the mbuf to not let the underlaying PMD driver (or the app) free the mbuf. Currently, only the heading mbuf is pinned. However, the mbuf free function would try to free all mbufs in the mbuf chain (-1 to the refcnt). This may lead the head mbuf being still pinned, while the other subsequent mbufs are actually freed. Which is wrong. It becomes more fatal after the mbuf refactor, more specificly, after the commit `8f094a9ac5` ("mbuf: set mbuf fields while in pool"). The refcnt resets to 1 after the last real reference. OTOH, it leads to a situtation that we never know one mbuf is actually freed or not. This would result the mbuf __just__ after the heading mbuf being freed twice: it's firstly freed (and put back to mempool) when the underlaying PMD finishes the DMA. Later, it will then be freed again when vhost unpins it. Meaning, one mbuf may be returned to the mempool twice, while in turn, being allocated twice later. Something uncertain may happen then. For example, the VM2VM case becomes broken. Fixes: `b0a985d1f3` ("vhost: add dequeue zero copy") Cc: stable@dpdk.org Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2017-04-19 10:49:06 +02:00
Yuanhan Liu	cca5c0c008	vhost: avoid memory write on net header when necessary Like what we did for virtio PMD driver [0][1], we could also apply such trick to vhost, to avoid the memory write on net header when necessary. [0]: `c9ea670c1d` ("net/virtio: fix performance regression due to TSO") [1]: `16994abee2` ("net/virtio: optimize header reset on any layout") With this, the cache issue of the mergeable path is again greatly reduced: even the write of "num_buffers" could be avoided. A quick PVP test shows the gap between the mergeable Rx and non-mergeable Rx is pretty small now: they are basically the same in my test. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2017-04-19 10:49:06 +02:00
Stephen Hemminger	c5ba278876	lib: remove unnecessary void cast Remove unnecessary casts of void * pointers to a specific type. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>	2017-04-11 18:05:10 +02:00
Yuanhan Liu	a798beb47c	vhost: rename header file Rename "rte_virtio_net.h" to "rte_vhost.h", to not let it be virtio net specific. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2017-04-01 10:42:44 +02:00
Yuanhan Liu	52f8091f05	vhost: export APIs for live migration support Export few APIs for the vhost-user driver to log the guest memory writes, which is a must for live migration support. This patch basically moves vhost_log_write() and vhost_log_used_vring() into vhost.h and then add an wrapper (the public API) to them. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2017-04-01 10:42:44 +02:00
Yuanhan Liu	ab4d7b9f1a	vhost: turn queue pair to vring The queue pair is very virtio-net specific, other devices don't have such concept. To make it generic, we should log the number of vrings instead of the number of queue pairs. This patch just does a simple convert, a later patch would export the number of vrings to applications. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2017-04-01 10:42:44 +02:00
Yuanhan Liu	dba9bf127b	vhost: export API to translate gpa to vva Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2017-04-01 10:42:44 +02:00
Kevin Traynor	4e9474141e	vhost: fix false sharing The broadcast_rarp field in the virtio_net struct is checked in the dequeue datapath regardless of whether descriptors are available or not. As it is checked with cmpset leading to a write, false sharing on the virtio_net struct can happen between enqueue and dequeue datapaths regardless of whether a RARP is requested. In OVS, the issue can cause a uni-directional performance drop of up to 15%. Fix that by only performing the cmpset if a read of broadcast_rarp indicates that the cmpset is likely to succeed. Fixes: `a66bcad322` ("vhost: arrange struct fields for better cache sharing") Cc: stable@dpdk.org Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2017-04-01 10:36:17 +02:00
Emmanuel Roullit	5c1f70daaf	vhost: do not GSO when no header is present Found with clang static analysis: lib/librte_vhost/virtio_net.c:723:17: warning: Access to field 'data_off' results in a dereference of a null pointer (loaded from variable 'tcp_hdr') m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2; ^~~~~~~~~~~~~~~~~ Fixes: `d0cf91303d` ("vhost: add Tx offload capabilities") Cc: stable@dpdk.org Signed-off-by: Emmanuel Roullit <emmanuel.roullit@gmail.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2017-01-30 13:46:57 +01:00
Yuanhan Liu	cc7301908c	vhost: fix dead loop in enqueue path If a malicious guest forges a dead loop desc chain (let desc->next point to itself) and desc->len is zero, this could lead to a dead loop in copy_mbuf_to_desc(following is a simplified code to show this issue clearly): while (mbuf_is_not_totally_consumed) { if (desc_avail == 0) { desc = &descs[desc->next]; desc_avail = desc->len; } COPY(desc, mbuf, desc_avail); } I have actually fixed a same issue before: commit `a436f53ebf` ("vhost: avoid dead loop chain"); it fixes the dequeue path though, leaving the enqueue path still vulnerable. The fix is the same. Add a var nr_desc to avoid the dead loop. Fixes: `f1a519ad98` ("vhost: fix enqueue/dequeue to handle chained vring descriptors") Cc: stable@dpdk.org Reported-by: Xieming Katty <katty.xieming@huawei.com> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2017-01-28 14:25:23 +01:00
Maxime Coquelin	2a51b1091c	vhost: support indirect descriptor in non-mergeable Rx Linux virtio-net kernel driver uses indirect descriptors when mergeable buffers are not used. This patch adds its support, fixing the use of indirect descriptors with these guests. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2016-10-26 13:39:09 +02:00
Maxime Coquelin	1be4ebb1c4	vhost: support indirect descriptor in mergeable Rx Windows virtio-net driver uses indirect descriptors with mergeable buffers. This patch adds its support, fixing the use of indirect descriptors with these guests. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2016-10-26 13:39:09 +02:00
Yuanhan Liu	f2f5bc8f30	vhost: retrieve available head once There is no need to retrieve the latest avail head every time we enqueue a packet in the mereable Rx path by avail_idx = ((volatile uint16_t )&vq->avail->idx); Instead, we could just retrieve it once at the beginning of the enqueue path. This could diminish the cache penalty slightly, because the virtio driver could be updating it while vhost is reading it (for each packet). Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> Reviewed-by: Jianbo Liu <jianbo.liu@linaro.org> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2016-10-26 13:39:09 +02:00
Yuanhan Liu	45847f015d	vhost: prefetch available ring Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> Reviewed-by: Jianbo Liu <jianbo.liu@linaro.org> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2016-10-26 13:39:09 +02:00
Zhihong Wang	f689586bc0	vhost: shadow used ring update The basic idea is to shadow the used ring update: update them into a local buffer first, and then flush them all to the virtio used vring at once in the end. And since we do avail ring reservation before enqueuing data, we would know which and how many descs will be used. Which means we could update the shadow used ring at the reservation time. It also introduce another slight advantage: we don't need access the desc->flag any more inside copy_mbuf_to_desc_mergeable(). Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Jianbo Liu <jianbo.liu@linaro.org> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2016-10-26 13:39:09 +02:00
Yuanhan Liu	fcdbe1fe1a	vhost: use last available index for ring reservation shadow_used_ring will be introduced later. Since then last avail idx will not be updated together with last used idx. So, here we use last_avail_idx for avail ring reservation. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> Reviewed-by: Jianbo Liu <jianbo.liu@linaro.org> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2016-10-26 13:39:09 +02:00
Yuanhan Liu	3f9e48f7da	vhost: simplify mergeable Rx vring reservation Let it return "num_buffers" we reserved, so that we could re-use it with copy_mbuf_to_desc_mergeable() directly, instead of calculating it again there. Meanwhile, the return type of copy_mbuf_to_desc_mergeable is changed to "int". -1 will be return on error. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> Reviewed-by: Jianbo Liu <jianbo.liu@linaro.org> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2016-10-26 13:39:09 +02:00
Zhihong Wang	e7ef688562	vhost: optimize cache access This patch reorders the code to delay virtio header write to improve cache access efficiency for cases where the mrg_rxbuf feature is turned on. CPU pipeline stall cycles can be significantly reduced. Virtio header write and mbuf data copy are all remote store operations which takes a long time to finish. It's a good idea to put them together to remove bubbles in between, to let as many remote store instructions as possible go into store buffer at the same time to hide latency, and to let the H/W prefetcher goes to work as early as possible. On a Haswell machine, about 100 cycles can be saved per packet by this patch alone. Taking 64B packets traffic for example, this means about 60% efficiency improvement for the enqueue operation. Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Jianbo Liu <jianbo.liu@linaro.org> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>	2016-10-26 13:39:09 +02:00
Maxime Coquelin	c843af3aa1	vhost: access header only if offloading is supported If offloading features are not negotiated, parsing the virtio header is not needed. Micro-benchmark with testpmd shows that the gain is +4% with indirect descriptors, +1% when using direct descriptors. Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2016-10-26 13:39:09 +02:00
Zhihong Wang	f46f655143	vhost: fix Windows VM hang This patch fixes a Windows VM compatibility issue in DPDK 16.07 vhost code which causes the guest to hang once any packets are enqueued when mrg_rxbuf is turned on by setting the right id and len in the used ring. As defined in virtio spec 0.95 and 1.0, in each used ring element, id means index of start of used descriptor chain, and len means total length of the descriptor chain which was written to. While in 16.07 code, index of the last descriptor is assigned to id, and the length of the last descriptor is assigned to len. How to test? 1. Start testpmd in the host with a vhost port. 2. Start a Windows VM image with qemu and connect to the vhost port. 3. Start io forwarding with tx_first in host testpmd. For 16.07 code, the Windows VM will hang once any packets are enqueued. Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> Acked-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>	2016-10-13 10:29:31 +02:00
Yuanhan Liu	b0a985d1f3	vhost: add dequeue zero copy The basic idea of dequeue zero copy is, instead of copying data from the desc buf, here we let the mbuf reference the desc buf addr directly. Doing so, however, has one major issue: we can't update the used ring at the end of rte_vhost_dequeue_burst. Because we don't do the copy here, an update of the used ring would let the driver to reclaim the desc buf. As a result, DPDK might reference a stale memory region. To update the used ring properly, this patch does several tricks: - when mbuf references a desc buf, refcnt is added by 1. This is to pin lock the mbuf, so that a mbuf free from the DPDK won't actually free it, instead, refcnt is subtracted by 1. - We chain all those mbuf together (by tailq) And we check it every time on the rte_vhost_dequeue_burst entrance, to see if the mbuf is freed (when refcnt equals to 1). If that happens, it means we are the last user of this mbuf and we are safe to update the used ring. - "struct zcopy_mbuf" is introduced, to associate an mbuf with the right desc idx. Dequeue zero copy is introduced for performance reason, and some rough tests show about 50% perfomance boost for packet size 1500B. For small packets, (e.g. 64B), it actually slows a bit down (well, it could up to 15%). That is expected because this patch introduces some extra works, and it outweighs the benefit from saving few bytes copy. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Qian Xu <qian.q.xu@intel.com>	2016-10-12 09:45:14 +02:00
Yuanhan Liu	f6be82d725	vhost: introduce last available index for dequeue So far, we retrieve both the used ring and avail ring idx by the var last_used_idx; it won't be a problem because the used ring is updated immediately after those avail entries are consumed. But that's not true when dequeue zero copy is enabled, that used ring is updated only when the mbuf is consumed. Thus, we need use another var to note the last avail ring idx we have consumed. Therefore, last_avail_idx is introduced. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Qian Xu <qian.q.xu@intel.com>	2016-10-12 09:45:12 +02:00

1 2

52 Commits