This patch caches all dirty pages logging until the used ring index
is updated.
The goal of this optimization is to fix a performance regression
introduced when the vhost library started to use atomic operations
to set bits in the shared dirty log map. While the fix was valid
as previous implementation wasn't safe against concurrent accesses,
contention was induced.
With this patch, during migration, we have:
1. Less atomic operations as only a single atomic OR operation
per 32 or 64 (depending on CPU) pages.
2. Less atomic operations as during a burst, the same page will
be marked dirty only once.
3. Less write memory barriers.
Fixes: 897f13a1f7 ("vhost: make page logging atomic")
Cc: stable@dpdk.org
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>
This patch enables the handling of buffers non-contiguous in
virtual address space in the vhost_crypto. Instead of using
rte_vhost_va_from_guest_pa(), the host virtual address is
converted by vhost_iova_to_vva() for wider use cases.
For copy mode, the copy length is limited to the chunk size,
next chunks VAs being fetched afterward.
Signed-off-by: Fan Zhang <roy.fan.zhang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
This patch fixes the redundant descriptor move in the copy mode
of vhost crypto. Originally the redundant descriptor move will
cause the message parsing error.
Fixes: 3bb595ecd6 ("vhost/crypto: add request handler")
Signed-off-by: Fan Zhang <roy.fan.zhang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Currently, populate_virt will check if mempool is already populated.
This will cause inability to reserve multi-chunk mempools if
contiguous memory is not a hard requirement, because if allocating
all-contiguous memory fails, mempool will retry with virtual addresses
and will call populate_virt. It seems that the original code never
anticipated more than one non-physically contiguous area.
Fix it by removing the check in populate virt. populate_anon() function
calls populate_virt() also, and it can be reasonably inferred that it is
expecting that virtual area is not already populated. Even though a
similar check is already in place there, also add the check that was
part of populate_virt() just in case.
Fixes: aab4f62d6c ("mempool: support no hugepage mode")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
The intention of the original code was to create runtime data
directory as early as possible, however it was moved too early,
before the arguments were parsed, resulting in --file-prefix
option essentially not working.
Fix this by moving eal_create_runtime_dir() to after command
line arguments parsing.
Fixes: 56236363b4 ("eal: add directory for runtime data")
Reported-by: Andrew Rybchenko <arybchenko@solarflare.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
Fix all calls to functions in eal_filesystem to produce paths
residing inside dedicated DPDK runtime directory. Leaving DPDK
runtime config in place as 3rd-party applications within the
DPDK ecosystem might rely on this path to determine whether
DPDK is running, so moving that will be postponed to the next
release cycle.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, during runtime, DPDK will store a bunch of files here
and there (in /var/run, /tmp or in $HOME). Fix it by creating a
DPDK-specific runtime directory, under which all runtime data
will be placed. The template for creating this runtime directory
is the following:
<base path>/dpdk/<DPDK prefix>/
Where <base path> is set to either "/var/run" if run as root, or
$XDG_RUNTIME_DIR if run as non-root, with a fallback to /tmp if
$XDG_RUNTIME_DIR is not defined. So, for example, if run as root,
by default all runtime data will be stored at /var/run/dpdk/rte/.
There is no equivalent of "mkdir -p", so we will be creating the
path step by step.
Nothing uses this new path yet, changes for that will come in
next commit.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
The original name for this path was not too descriptive and
confusing. Rename it to a more appropriate and descriptive name:
it stores data about hugepages, so name it eal_hugepage_data_path().
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
The define was a leftover from IVSHMEM library.
Fixes: c711ccb309 ("ivshmem: remove library and its EAL integration")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: David Marchand <david.marchand@6wind.com>
Description of rte_eth_dev_get_name_by_port() calls
port ID argument a pointer, which is misleading.
Also, output buffer minimal size is not mentioned.
These points need to be improved.
Fixes: bde516d5a8 ("ethdev: get port by name")
Cc: stable@dpdk.org
Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Relax the check for queue setup, since some device
may not update queue states during dev_stop.
Fixes: cac923cfea ("ethdev: support runtime queue setup")
Signed-off-by: Yanglong Wu <yanglong.wu@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
When an ethdev port is released, a destroy event is triggered to notify
the users about the released port.
A bit before the destroy event is triggered, the port becomes invalid
by changing its state to UNUSED and cleaning its data. Therefore, the
port is invalid for the destroy event callback process and the users
may get a wrong information of the port.
Move the destroy event emitting to be called before the port
invalidation.
Fixes: 133b54779a ("ethdev: fix port data reset timing")
Fixes: 29aa41e36d ("ethdev: add notifications for probing and removal")
Cc: stable@dpdk.org
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
The new device was notified as soon as it was allocated.
It leads to use a device which is not yet initialized.
The notification must be published after the initialization is done
by the PMD, but before the state is changed, in order to let
notified entities taking ownership before general availability.
Fixes: 29aa41e36d ("ethdev: add notifications for probing and removal")
Cc: stable@dpdk.org
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
The port was set to the state ATTACHED during allocation.
The consequence was to iterate over ports which are not initialized.
The state ATTACHED is now set as the last step of probing.
The uniqueness of port name is now checked before the availability
of a port id for allocation (order reversed).
As the state is not set on allocation anymore, it is also not checked
in the function telling whether a port is allocated or not.
The name of the port is set on allocation, so it is enough as a check.
Fixes: 5588909af2 ("ethdev: add device iterator")
Cc: stable@dpdk.org
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Signed-off-by: Matan Azrad <matan@mellanox.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
When comparing the port name, there can be a race condition with
a thread allocating a new port and writing the name at the same time.
It can lead to match with a partial name by error.
The check of the port is now considered as a critical section
protected with locks.
This fix will be even more required for multi-process when the
port availability will rely only on the name, in a following patch.
Fixes: 84934303a1 ("ethdev: synchronize port allocation")
Cc: stable@dpdk.org
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
When the state will be updated later than in allocation,
we may need to update the ownership of a port which is
still in state unused.
It will be used to take ownership of a port before it is
declared as available for other entities.
Cc: stable@dpdk.org
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
A new hook function is added and called inside the PMDs at the end
of the device probing:
- in primary process, after allocating, init and config
- in secondary process, after attaching and local init
This new function is almost empty for now.
It will be used later to add some post-initialization processing.
For the PMDs calling the helpers rte_eth_dev_create() or
rte_eth_dev_pci_generic_probe(), the hook rte_eth_dev_probing_finish()
is called from here, and not in the PMD itself.
Note that the helper rte_eth_dev_create() could be used more,
especially for vdevs, avoiding some code duplication in PMDs.
Cc: stable@dpdk.org
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
The enum rte_eth_dev_state was not properly documented.
Its values did not appear in the doxygen output,
and may be misunderstood.
The state RTE_ETH_DEV_DEFERRED has no interest anymore
since the ownership mechanism brings a more flexible categorization.
This state could be removed later.
Fixes: d52268a8b2 ("ethdev: expose device states")
Fixes: cb894d99ec ("ethdev: add deferred intermediate device state")
Fixes: 5b7ba31148 ("ethdev: add port ownership")
Fixes: 7106edc123 ("ethdev: add devop to check removal status")
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Matan Azrad <matan@mellanox.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
The owner id is 64-bit.
On 32-bit environment, it must be printed with PRIX64.
Fixes: 5b7ba31148 ("ethdev: add port ownership")
Cc: stable@dpdk.org
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
This patch check if a input requested offloading is valid or not.
Any reuqested offloading must be supported in the device capabilities.
Any offloading is disabled by default if it is not set in the parameter
dev_conf->[rt]xmode.offloads to rte_eth_dev_configure() and
[rt]x_conf->offloads to rte_eth_[rt]x_queue_setup().
If any offloading is enabled in rte_eth_dev_configure() by application,
it is enabled on all queues no matter whether it is per-queue or
per-port type and no matter whether it is set or cleared in
[rt]x_conf->offloads to rte_eth_[rt]x_queue_setup().
If a per-queue offloading hasn't be enabled in rte_eth_dev_configure(),
it can be enabled or disabled for individual queue in
ret_eth_[rt]x_queue_setup().
A new added offloading is the one which hasn't been enabled in
rte_eth_dev_configure() and is reuqested to be enabled in
rte_eth_[rt]x_queue_setup(), it must be per-queue type,
otherwise trigger an error log.
The underlying PMD must be aware that the requested offloadings
to PMD specific queue_setup() function only carries those
new added offloadings of per-queue type.
This patch can make above such checking in a common way in rte_ethdev
layer to avoid same checking in underlying PMD.
This patch assumes that all PMDs in 18.05-rc2 have already
converted to offload API defined in 17.11 . It also assumes
that all PMDs can return correct offloading capabilities
in rte_eth_dev_infos_get().
In the beginning of [rt]x_queue_setup() of underlying PMD,
add offloads = [rt]xconf->offloads |
dev->data->dev_conf.[rt]xmode.offloads; to keep same as offload API
defined in 17.11 to avoid upper application broken due to offload
API change.
PMD can use the info that input [rt]xconf->offloads only carry
the new added per-queue offloads to do some optimization or some
code change on base of this patch.
Signed-off-by: Wei Dai <wei.dai@intel.com>
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Calling dev_infos_get() devops directly in rte_eth_dev_configure cause
random values in uninitialized fields because devops doesn't reset the
dev_info structure.
Call rte_eth_dev_info_get() API instead which memset the struct.
Also remove duplicated dev_infos_get existence check.
Fixes: 3be82f5cc5 ("ethdev: support PMD-tuned Tx/Rx parameters")
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
When the vhost-user master sends memory updates using
VHOST_USER_SET_MEM request, the user backends unmap and then
mmap again the memory regions in its address space.
If the ring addresses have already been translated, it needs to
be translated again as they point to unmapped memory.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
A bracket was misplaced in a condition check, this patch
fixes it.
Coverity issue: 277232, 277237
Fixes: 3bb595ecd6 ("vhost/crypto: add request handler")
Signed-off-by: Fan Zhang <roy.fan.zhang@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
In the loop to copy virtio-net header to the descriptor buffer,
destination pointer was incremented instead of the source
pointer.
Fixes: fb3815cc61 ("vhost: handle virtually non-contiguous buffers in Rx-mrg")
Fixes: 6727f5a739 ("vhost: handle virtually non-contiguous buffers in Rx")
Cc: stable@dpdk.org
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
when rte_vhost_driver_unregister detstroy the vsocket, we
should set it to NULL after freeing it, because in client mode,
the conn may be added to reconnect thread while vsocket is
destroyed. In one case, if qemu create vhostuser port as a
server with the same unix path, the reconnect thread will
reconnect to it while vsocket is destroyed.
To fix this:
1. set vsocket to NULL after free it.
2. remove the reconnection from reconnection thread in suitable
position.
Cc: stable@dpdk.org
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
When qemu close the unix socket fd of the vhostuser as a
server, and then immediately delete the vhostuser port on
openvswitch. There will be a deadlock.
A thread (fdset event thread): B thread:
1. fdset_event_dispatch rte_vhost_driver_unregister
2. set the fd busy to 1. lock vsocket->conn_mutex
3. vhost_user_read_cb fdset_del waits busy changed to 0.
4. vhost peer closed, remove the
conn from vsocket->conn_list:
lock vsocket->conn_mutex
5. set the fd busy to 0
Fixes: 65388b43f5 ("vhost: fix fd leaks for vhost-user server mode")
Cc: stable@dpdk.org
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tx offload will be converted to txq_flags automatically during
rte_eth_dev_info_get and rte_eth_tx_queue_info_get. So PMD can
clean the code to get rid of txq_flags at all while keep old APP
not be impacted.
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
In ip_frag_process, some IP_FRAG_LOG content is wrong.
Fixes: 4f1a8f6338 ("ip_frag: add IPv6 reassembly")
Cc: stable@dpdk.org
Signed-off-by: Li Han <han.li1@zte.com.cn>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
- add EXPERIMENTAL tag for the section in MAINTAINERS.
- add EXPERIMENTAL tag to BPF public API files.
- add attribute __rte_experimental to BPF public API declarations.
Fixes: 94972f35a0 ("bpf: add BPF loading and execution framework")
Fixes: 5dba93ae5f ("bpf: add ability to load eBPF program from ELF object file")
Fixes: a93ff62a89 ("bpf: introduce basic Rx/Tx filters")
Reported-by: Thomas Monjalon <thomas@monjalon.net>
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Remove version tag from experimental block in linker version scripts
(.map files).
That label is not used by linker and information only. It is useful
for version blocks but not useful for experimental block but confusing.
Removing those labels.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Currently, page deallocation might fail if allocator cannot get page
fd, which will leave VA space still mapped, and will also not mark
page as free.
Fix page deallocation function to always unmap space before trying
to get rid of the page itself, and always mark page as free even if
page deallocation failed.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Fixes: 1a7dc2252f ("mem: revert to using flock and add per-segment lockfiles")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Return value should be zero for success, but if unlock and unlink
have succeeded, return value was 1, which triggered failure message
in calling code.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Segment index was calculated incorrectly, causing free_seg to
attempt to free segments that do not exist.
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Yong Liu <yong.liu@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
If total memory is already bigger than max memory, an underflow
will occur on subtraction. Fix it by simply stopping whenever
we already have amount of memory that is bigger than maximum.
Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Currently, reserving a memzone with length set to 0 will not trigger
any memory allocations, and memzone will instead be looking through
already allocated memory only. Document this limitation.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Size of malloc heap elements include overhead, which should not
be counted as part of memzone.
Fixes: fafcc11985 ("mem: rework memzone to be allocated by malloc")
Cc: stable@dpdk.org
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Deallocation used the wrong function, which could have resulted in
race conditions because the function does not use locks internally.
Fixes: 1403f87d4f ("malloc: enable memory hotplug support")
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
When we ask to reserve virtual areas, we usually include
alignment in the mapping size, and that memory ends up
being wasted. Wasting a gigabyte of VA space while trying to
reserve one gigabyte is pretty expensive on 32-bit, so after
we're done mapping, unmap unneeded space.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Mapping size is a 64-bit integer, but mmap() will accept size_t for
size mappings. A user could request a mapping with an alignment, which
would have overflown size_t, so check if (size + alignment) will
overflow size_t.
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
The code aimed to pick and remember the value of
mempool ops name from EAL command line arguments does not
copy the string and remembers the pointer provided
by getopt_long() directly. The latter could be clobbered
later and result in reading wrong mbuf pool ops name
by rte_mempool library.
Typically, this flaw could be avoided by using strdup()
to remember the string value of the option.
Fixes: a103a97e71 ("eal: allow user to override default mempool driver")
Cc: stable@dpdk.org
Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
In function 'crc32c_sse42_u64_mimic':
rte_hash_crc.h:402:40:
warning: conversion from 'uint64_t' {aka 'long unsigned int'}
to 'uint32_t' {aka 'unsigned int'} may change value [-Wconversion]
init_val = crc32c_sse42_u32(d.u32[0], init_val);
Fixes: 00bf774bab ("hash: add assembly implementation of CRC32 intrinsics")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
In function 'crc32c_2words':
rte_hash_crc.h:347:2:
warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
uint32_t crc, term1, term2;
Fixes: d983cf4169 ("hash: add software CRC32 implementation")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
In function 'rte_eth_tx_buffer_flush':
rte_ethdev.h:4248:55:
warning: conversion from 'int' to 'uint16_t'
{aka 'short unsigned int'} may change value [-Wconversion]
buffer->error_callback(&buffer->pkts[sent], to_send - sent,
Fixes: d6c99e62c8 ("ethdev: add buffered Tx")
Cc: stable@dpdk.org
Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>