A flow counter which was allocated by a batch API couldn't be assigned
to a flow in the root table (group 0) in old rdma-core version.
Hence, a root table flow counter required PMD mechanism to manage
counters which were allocated singly.
Currently, the batch counters have already been supported in root table
includes a new rdma-core version with MLX5_FLOW_ACTION_COUNTER_OFFSET
enum and with a kernel driver includes
MLX5_IB_ATTR_CREATE_FLOW_ARR_COUNTERS_DEVX_OFFSET enum.
When the PMD uses rdma-core API to assign a batch counter to a root
table flow using invalid counter offset, it should get an error only
if the batch counter assignment for root table is supported.
Using this trial in the initialization time can help to detect the
support.
Using the above trial, if the support is valid, remove the management of
single counter container in the fast counter mechanism. Otherwise, move
the counter mechanism to fallback mode.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Commit [1] introduced different container for the aging counter
pools. In order to save container memory the aging counter pools
can be located in the general pool container.
This patch locates the aging counter pools in the general pool
container. Remove the aging container management.
[1] commit fd143711a6 ("net/mlx5: separate aging counter pool range")
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
When a Rx\Tx queue is created by DevX, its CQ configuration should
include the EQ number of the interrupts.
The EQ is managed by the kernel and there is a glue API in order to
query the EQ number from the kernel.
The EQ query API gets a vector number specifies the kernel vector of
the interrupt handling.
The vector number was wrongly detected according to the configuration
CPU instead of using the device attributes of the supported vectors.
The CPU was wrongly detected by the rte_lcore_to_cpu_id API without any
check, and in case of non-EAL thread context the value was 0xFFFFFFFF
which caused a failure in the EQ number query API.
Use vector 0 for each EQ number query which must be supported by the
kernel.
Fixes: 08d1838f64 ("net/mlx5: implement CQ for Rx using DevX API")
Fixes: d133f4cdb7 ("net/mlx5: create clock queue for packet pacing")
Cc: stable@dpdk.org
Signed-off-by: Matan Azrad <matan@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The API function rte_eth_dev_close() was returning void.
The return type is changed to int for notifying of errors.
If an error happens during a close operation,
the status of the port is undefined,
a maximum of resources having been freed.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Liron Himi <lironh@marvell.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Translate the attribute of sample action that include sample ratio
and sub actions list.
PMD will check the destination action number in current flow,
if found multiple destination actions, then create the new destination
array rdma action that group actions for each destination.
Currently only support port or queue for destination action, and only
encap action can be attached into one port destination.
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The flow with sample action will be split into two sub flows:
the prefix sub flow with the all actions preceding the sample
action and sample action itself, and the suffix sub flow with
the actions following the sample action.
The original items remain in the prefix sub flow, add the
implicit tag action with unique id to set in metadata register,
and suffix sub flow uses the tag item to match with that unique id.
The flow split as below:
Original flow: items / actions pre / sample / actions sfx ->
prefix sub flow -
items / actions pre / set_tag action / sample
suffix sub flow -
tag_item / actions sfx
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The eqn field has become a field of sh directly since it is also
relevant for Tx and Rx.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The device operation .dev_close was returning void.
This driver interface is changed to return an int.
Note that the API rte_eth_dev_close() is still returning void,
although a deprecation notice is pending to change it as well.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Rosen Xu <rosen.xu@intel.com>
Reviewed-by: Sachin Saxena <sachin.saxena@oss.nxp.com>
Reviewed-by: Liron Himi <lironh@marvell.com>
Reviewed-by: Haiyue Wang <haiyue.wang@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Several DV-based structs of type 'struct mlx5dv_devx_XXX' are replaced
with 'void *' to enable compilation under non-Linux operating systems.
New getter functions were added to retrieve the specific fields that
were previously accessed directly.
Replaced structs:
'struct mlx5dv_pp *'
'struct mlx5dv_devx_event_channel *'
'struct mlx5dv_devx_umem *'
'struct mlx5dv_devx_uar *'
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This commit adds Linux implementation of routine mlx5_os_mac_addr_flush
as wrapper to Netlink API to avoid direct calls under non-Linux
operating systems.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
mlx5_get_ifname() prototype includes 'IF_NAMESIZE' definition from Linux
file net/if.h. Since this API is only used under Linux and to enable
compilation under non-Linux OS - move this prototype from shared file
mlx5.h to file linux/mlx5_os.h.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The User Access Region is a special mechanism to provide direct
access to the hardware registers, and is the part of PCI address
space that is mapped to CPU virtual address. The mapping can be
performed with the type "Write-Combining" or "Non-Cached", and
these ones might be supported or not on different setups.
To prevent device probing failure the UAR allocation attempt
with alternative mapping type is performed. The datapath
takes the actual UAR mapping into account on queue creation.
There was another issue with NULL UAR base address.
OFED 5.0.x and Upstream rdma_core before v29 returned the NULL as
UAR base address if UAR was not the first object in the UAR page.
It caused the PMD failure and we should try to get another UAR
till we get the first one with non-NULL base address returned.
Fixes: fc4d4f732b ("net/mlx5: introduce shared UAR resource")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Ori Kam <orika@mellanox.com>
When PMD releases shared IB device context, It locks the
mlx5_ibv_list_mutex lock throughout the function so that it does not
happen while removing a device from the list, another process will try
to insert another device into it.
On the other hand, having removed the device from the list even if it
has not yet released all of its resources, it should not care about
other processes and can release the lock.
However, the PMD does not release the lock even though it can, and
performs a number of operations, some of which include sleep and may be
long.
To improve this, shorten the lock time to the minimum necessary.
Signed-off-by: Michael Baum <michaelba@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Migrate mlx5 net, vdpa and regex PMD to start using mlx5 common class
driver.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
mlx5_common is shared library between mlx5 net, VDPA and regex PMD.
It is better to use common initialization helper instead of using
RTE_PRIORITY_CLASS priority.
Suggested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
This patch continues the work to use DevX API for different objects
creation and management.
On Rx control path, the RQ, RQT, and TIR objects can already be
created using DevX API.
This patch adds the support to create CQ for RxQ using DevX API.
The corresponding event channel is also created and utilized using
DevX API.
Signed-off-by: Dekel Peled <dekelp@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Several source files include Verbs header files as in (1). These source
files will not compile under non-Linux operating systems. This commit
removes this inclusion in two cases:
Case 1: There is no usage of ibv_* or mlx5dv_* symbols in the source
file so the inclusion in (1) can be safely removed.
Case 2: Verbs symbols are used. Please note the inclusion in (1) already
appears in file linux/mlx5_glue.h (which represents the interface
to the rdma-core library). Therefore, replace (1) in the source file
with (2). Under non-Linux operating systems - file mlx5_glue.h will not
include (1).
(1)
#include <infiniband/verbs.h>
#include <infiniband/mlx5dv.h>
(2)
#include <mlx5_glue.h>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
1. The shared data communication between the primary and the secondary
processes is implemented using Linux API. Move the Linux API code under
linux directory (file linux/mlx5_os.c).
2. File net/mlx5/mlx5_mp.c handles requests to the primary and secondary
processes (e.g. start_rxtx, stop_rxtx). It is Linux based so it is moved
under linux (new file linux/mlx5_mp_os.c).
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
There are some limitations on some NICs (at least on ConnectX-6 Dx
and BlueField 2) with supporting FCS (frame checksum) scattering for
the tunnel decapsulated packets.
For the case only one of the features can be supported in the same time,
and the new devarg "decap_en" is introduced to provide the choice to the
users.
If FCS scattering feature is not supposed to be engaged by application,
this new devarg should be specified as "decap_en=0", forcing the FCS
feature enable and rejecting tunnel decap actions in the rte_flow engine.
If FCS scatter is not needed and application supposes to use tunnel
decapsulation in rte_flow, the devarg can be omitted or set to non-zero
value (this is default settings).
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit allocates the miscellaneous configuration objects from the
unified malloc function.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Currently, for MLX5 PMD, once millions of flows created, the memory
consumption of the flows are also very huge. For the system with limited
memory, it means the system need to reserve most of the memory as huge
page memory to serve the flows in advance. And other normal applications
will have no chance to use this reserved memory any more. While most of
the time, the system will not have lots of flows, the reserved huge
page memory becomes a bit waste of memory at most of the time.
By the new sys_mem_en devarg, once set it to be true, it allows the PMD
allocate the memory from system by default with the new add mlx5 memory
management functions. Only once the MLX5_MEM_RTE flag is set, the memory
will be allocate from rte, otherwise, it allocates memory from system.
So in this case, the system with limited memory no need to reserve most
of the memory for hugepage. Only some needed memory for datapath objects
will be enough to allocated with explicitly flag. Other memory will be
allocated from system. For system with enough memory, no need to care
about the devarg, the memory will always be from rte hugepage.
One restriction is that for DPDK application with multiple PCI devices,
if the sys_mem_en devargs are different between the devices, the
sys_mem_en only gets the value from the first device devargs, and print
out a message to warn that.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
If the NIC or the FW does not support the dynamic flex parser,
it will return error when trying to create the parser for eCRPI.
Then it is hard to know the detail error reason of the failure.
Before creating the parser node and the following usage of the
parser, the capacity bit saved in the HCA_CAP could be used to
confirm if the dynamic flex parser is supported.
If no, an error will be returned directly with ENOTSUP to prevent
the following steps to be executed.
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
eCPRI protocol has unified format layout for the variants, over
ETH layer (including .1Q) and UDP layer.
The common header of the message has 4 bytes fixed length, and the
message payload layers are different based on the type field. Now
only type #0, #2 and #5 will be supported, and 2 bytes are needed.
When creating the flex parser, the header will be extended to 8
bytes and 2 DW samples are needed. The 1st DW starts from offset 0
and will be used for the type field of the common header. The 2nd
DW starts from offset 4 and will be used for the physical channel
ID, real-time control ID or measurement ID fields.
The parser will be created once a flow with eCPRI item is observed
for the first time. After creating, it will remain in the system
and HW until the device is stopped. Right now, there is no need to
destroy the eCPRI flex parser after the last flow with eCPRI item
is destroyed. This is to get rid of the alternate states of creating
and destroying eCPRI flex parser with a single eCPRI flow.
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
In the translation stage, the eCPRI item should be translated into
the format that lower layer driver could use. All the fields that
need to match must be in network byte order after translation, as
well as the mask. Since the header in the item belongs to the network
layers stack, and the input parameter of the header is considered to
be in big-endian format already.
Base on the definition in the PRM, the DW samples will be used for
matching in the FTE/STE. Now, the type field and only the PC ID, RTC
ID, and DLY MSR ID of the payload will be supported. The masks should
be 00 ff 00 00 ff ff(00) 00 00 in the network order. Two DWs are
needed to support such matching. The mask fields could be zeros to
support some wildcard rules. But it makes no sense to support the
rule matching only on the payload but without matching type field.
The DW samples should be stored after the flex parser creation for
eCPRI. There is no need to query the sample IDs each time when
creating a flow rule with eCPRI item. It will not introduce
insertion rate degradation significantly.
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This patch creates the special completion queue providing
reference completions to schedule packet send from
other transmitting queues.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This is preparation step before moving the Tx queue creation
to the DevX approach. Some features require the shared UAR
for Tx queues and scheduling completion queues, the patch
manages the shared UAR.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The master and representors might be created over the multiport
Infiniband devices and the UAR resource allocated for sibling
ports might belong to the same underlying Infiniband device.
Hardware requires the write access to the UAR must be performed
as atomic 64-bit write, on 32-bit systems this is two sequential
writes, protected by lock. Due to possibility to share the same
UAR between sibling devices the locks must be moved to shared
context.
Fixes: f048f3d479 ("net/mlx5: switch to the shared IB device context")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This patch introduces the new devargs:
tx_pp - enables accurate packet send scheduling on mbuf timestamps
in the PMD. On the device start if "rte_dynflag_timestamp"
dynamic flag is registered and this devarg non-zero value is
specified, the driver initializes all necessary internal
infrastructure to provide packet scheduling. The parameter
value specifies scheduling granularity in nanoseconds.
tx_skew - the parameter adjusts the send packet scheduling on
timestamps and represents the average delay between beginning
of the transmitting descriptor processing by the hardware and
appearance of actual packet data on the wire. The value should
be provided in nanoseconds and is valid only if tx_pp parameter
is specified. The default value is zero.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This adds the ConnectX-6 Lx device id to the list of supported
Mellanox devices that run the MLX5 PMD.
The device is still in development stage.
Signed-off-by: Ali Alnubani <alialnu@mellanox.com>
Acked-by: Raslan Darawsheh <rasland@mellanox.com>
Introduce the RTE_LOG_REGISTER macro to avoid the code duplication
in the logtype registration process.
It is a wrapper macro for declaring the logtype, registering it and
setting its level in the constructor context.
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Adam Dybkowski <adamx.dybkowski@intel.com>
Acked-by: Sachin Saxena <sachin.saxena@nxp.com>
Acked-by: Akhil Goyal <akhil.goyal@nxp.com>
The new devarg will control the steering of the lacp traffic.
When setting dv_lacp_by_user = 0 the lacp traffic will be
steered to kernel and managed there.
When setting dv_lacp_by_user = 1 the lacp traffic will
not be steered and the user will need to manage it.
Signed-off-by: Shiri Kuzin <shirik@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The creation of DBR can be used by a number of different
Mellanox PMDs. for example RegEx / Net / VDPA.
This commits moves the DBR creation and release functions to common
folder.
Signed-off-by: Ori Kam <orika@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Getter functions such as: 'mlx5_os_get_ctx_device_name',
'mlx5_os_get_ctx_device_path', 'mlx5_os_get_dev_device_name',
'mlx5_os_get_umem_id' are implemented under net directory. To enable
additional devices (e.g. regex, vdpa) to access these getter functions
they are moved under common directory.
As part of this commit string sizes DEV_SYSFS_NAME_MAX and
DEV_SYSFS_PATH_MAX are increased by 1 to make sure that the destination
string size in strncpy() function is bigger than the source string size.
This update will avoid GCC version 8 error -Werror=stringop-truncation.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Currently, when allocate a new counter, it needs loop the whole
container pool list to get a free counter.
In the case with millions of counters allocated, and all the pools
are empty, allocate the new counter will still need to loop the
whole container pool list first, then allocate a new pool to get a
free counter. It wastes the cycles during the pool list traversal.
Add a global free counter list in the container helps to get the free
counters more efficiently.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
For single counter, when allocate a new counter, it needs to find the pool
it belongs in order to do the query together.
Once there are millions of counters allocated, the pool array in the
counter container will become very large. In this case, the pool search
from the pool array will become extremely slow.
Save the minimum and maximum counter ID to have a quick check of current
counter ID range. And start searching the pool from the last pool in the
container will mostly get the needed pool since counter ID increases
sequentially.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Currently, to check if any shared counter with same ID existing, it will
have to loop the counter pools to search for the counter. Even add the
counter to the list will also not so helpful while there are thousands
of shared counters in the list.
Change Three-Level table to look up the counter index saved in the
relevant table entry will be more efficient.
This patch introduces the Three-level table to save the ID relevant
counter index in the table. Then the next while the same ID comes, just
check the table entry of this ID will get the counter index directly.
No search will be needed.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Prior to this commit MR operations were verbs based and hard coded under
common/mlx5/linux directory. This commit enables upper layers (e.g.
net/mlx5) to determine which MR operations to use. For example the net
layer could set devx based MR operations in non-Linux environments. The
reg_mr and dereg_mr callbacks are added to the global per-device MR
cache 'struct mlx5_mr_share_cache'.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
There are three types of eth_dev_ops: primary, secondary and isolate.
Their function calls assignments are moved from common file
mlx5.c to the Linux specific file linux/mlx5_os.c.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
In DV enabled MLX5 PMD build mlx5_ipool_cfg[MLX5_IPOOL_MLX5_FLOW].size
was initiated for DV structure. If RTE initialization encountered MLX5
PCI function with disabled DV support
mlx5_ipool_cfg[MLX5_IPOOL_MLX5_FLOW].size was reduced to match legacy
verbs flow size. Since mlx5_ipool_cfg[MLX5_IPOOL_MLX5_FLOW] is a
global variable that change reflected on DV enabled MLX5 PCI functions
too.
Running flow with invalid ipool size crashes PMD.
The patch adjusts ipool flow size for each active PCI function.
Fixes: b88341ca35 ("net/mlx5: convert flow dev handle to indexed")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
1. Replace 'struct ibv_device *' with 'void *' in 'struct
mlx5_dev_spawn_data'. Define a getter function to retrieve the
device name.
2. Rename ibv_dev and ibv_port as phys_dev and phys_port
respectively.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
File drivers/net/linux/mlx5_os.h is added. It includes specific
Linux definitions such as PCI driver flags, link state changes
interrupts, link removal interrupts, etc.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Refactor PCI probing related code. Move Linux specific functions (as
well as verbs and dv related code) from mlx5.c file to linux/mlx5_os.c
file.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
umem field is used in several structs. Its type 'struct mlx5dv_devx_umem
*' is changed to 'void *'. This change will allow non-Linux OS
compilations.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Define 'struct mlx5_dev_attr' which is ibv and dv independent. It
contains attribute that were originally contained in 'struct
ibv_device_attr_ex' and 'struct mlx5dv_context dv_attr'. Add a new API
mlx5_os_get_dev_attr() which fills in the new defined struct.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
'ctx' type (field in 'struct mlx5_ctx_shared') is changed from 'struct
ibv_context *' to 'void *'. 'ctx' members which are verbs dependent
(e.g. device_name) will be accessed through getter functions which are
added to a new file under Linux directory: linux/mlx5_os.c.
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Currently, when flow destroyed, some memory resources may still be kept
as cached to help next time create flow more efficiently.
Some system may need the resources to be more flexible with flow create
and destroy. After peak time, with millions of flows destroyed, the
system would prefer the resources to be reclaimed completely, no cache
is needed. Then the resources can be allocated and used by other
components. The system is not so sensitive about the flow insertion
rate, but more care about the resources.
Both DPDK mlx5 PMD driver and the low level component rdma-core have
provided the flow resources to be configured cached or not, but there is
no APIs or parameters exposed to user to configure the flow resources
cache mode. In this case, introduce a new PMD devarg to let user
configure the flow resources cache mode will be helpful.
This commit is to add a new "reclaim_mem_mode" to help user configure if
the destroyed flows' cache resources should be kept or not.
Their will be three mode can be chosen:
1. 0(none). It means the flow resources will be cached as usual. The
resources will be cached, helpful with flow insertion rate.
2. 1(light). It will only enable the DPDK PMD level resources reclaim.
3. 2(aggressive). Both DPDK PMD level and rdma-core low level will be
configured as reclaimed mode.
With these three mode, user can configure the resources cache mode with
different levels.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Currently, the DevX counter query works asynchronously with Devx
interrupt handler return the query result. When port closes, the
interrupt handler will be uninstalled and the Devx comp obj will
also be destroyed. Meanwhile the query is still not cancelled.
In this case, counter query may use the invalid Devx comp which
has been destroyed, and query failure with invalid FD will be
reported.
Adjust the shared interrupt install and uninstall timing to make
the counter asynchronous query stop before interrupt uninstall.
Fixes: f15db67df0 ("net/mlx5: accelerate DV flow counter query")
Cc: stable@dpdk.org
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
When secondary process starts, it will allocate its own process private
data, and also does remap to UAR register of the Tx queue. Once the
secondary process exits, these resources should be released accordingly.
And the shared resources owned by primary should not be touched.
Currently, once one port in the secondary process spawn failed, all the
other spawned ports will also be released during process exits. However,
the mlx5_dev_close() function does not add the cases for secondary
process, it means call the mlx5_dev_close() function directly in
secondary process releases the resources it should not touch.
Add the case for secondary process release to its own resources in
mlx5_dev_close() function to help it quits gracefully.
Fixes: 942d13e6e7 ("net/mlx5: fix sharing context destroy order")
Fixes: 3a8207423a ("net/mlx5: close all ports on remove")
Cc: stable@dpdk.org
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Removed the typing error in doc/guides/eventdevs/index.rst,
drivers/net/mlx5/mlx5.c and in lib/librte_vhost/rte_vhost.h
Bugzilla ID: 477
Fixes: 0857b94211 ("doc: add event device and software eventdev")
Fixes: 039253166a ("vhost: add device op when notification to guest is sent")
Fixes: ad74bc6195 ("net/mlx5: support multiport IB device during probing")
Cc: stable@dpdk.org
Signed-off-by: Muhammad Bilal <m.bilal@emumba.com>
The doorbell record is organized with page and bitmap. When some new
doorbell needs to be associated with a queue, the bit will be set
in the bitmap to indicate the corresponding doorbell occupied. A
counter is used to record the number of doorbell occupied to speed
up the searching.
If the number reaches the maximal value of a pre-defined number of a
page, a new page will be allocated. If not, then the bitmap will be
checked to find a free one.
The LSHIFT and OR (AND NOT) operations are used to update the bitmap
of a page. But 1 will be treated as a signed integer when compiling.
When the shift number is 31, the shifted value will be considered as
negative. Then a wrong extension will be done when setting it to a
64-bits variable. All the upper 32-bits will be set to 1 by such
extension.
Then a wrong offset value will be calculated because of this. The
next 64 bits will be also treated as the bitmap and get corrupted
through the bit set operation.
The immediate value 1 needs to be used as 64 bits width explicitly.
Fixes: 21cae8580f ("net/mlx5: allocate door-bells via DevX")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The design of counter container resize used double buffer algorithm in
order to synchronize between the query thread to the control thread.
When the control thread detected resize need, it created new bigger
buffer for the counter pools in a new container and change the container
index atomically.
In case the query thread had not detect the previous resize before a new
one need was detected by the control thread, the control thread returned
EAGAIN to the flow creation API used a COUNT action.
The rte_flow API doesn't allow unblocked commands and doesn't expect to
get EAGAIN error type.
So, when a lot of flows were created between 2 different periodic
queries, 2 different resizes might try to be created and caused EAGAIN
error.
This behavior may blame flow creations.
Change the synchronization way to use lock instead of double buffer
algorithm.
The critical section of this lock is very small, so flow insertion
rate should not be decreased.
Fixes: ebbac312e4 ("net/mlx5: resize a full counter container")
Cc: stable@dpdk.org
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Currently, there is no flow aging check and age-out event callback
mechanism for mlx5 driver, this patch implements it. It's included:
- Splitting the current counter container to aged or no-aged container
since reducing memory consumption. Aged container will allocate extra
memory to save the aging parameter from user configuration.
- Aging check and age-out event callback mechanism based on current
counter. When a flow be checked aged-out, RTE_ETH_EVENT_FLOW_AGED
event will be triggered to applications.
- Implement the new API: rte_flow_get_aged_flows, applications can use
this API to get aged flows.
Signed-off-by: Dong Zhou <dongz@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Currently, the counter pool needs 512 ext-counter memory for no batch
counters, it's allocated separately by once, behind the 512
basic-counter memory. This is not easy to get ext-counter pointer by
corresponding basic-counter pointer. This is also no easy for expanding
some other potential additional type of counter memory.
So, need allocate every one of ext-counter and basic-counter together,
as a single piece of memory. It's will be same for further additional
type of counter memory. In this case, one piece of memory contains all
type of memory for one counter, it's easy to get each type memory by
using offsetting.
Signed-off-by: Dong Zhou <dongz@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The asserts makes sure that 'i' doesn't exceed the expected value.
This to prevent an out of bound access to dbr_bitmap.
The current location of the assert protects the assignment of
dbr_bitmap, but not the access to it.
Moved the assert to the correct place, to protect both cases.
Also, used an existing define for the assert.
Fixes: 21cae8580f ("net/mlx5: allocate door-bells via DevX")
Cc: stable@dpdk.org
Signed-off-by: Asaf Penso <asafp@mellanox.com>
Reviewed-by: Dekel Peled <dekelp@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Currently, the rte flow structure is not fully aligned and has some
bits wasted. The members can be optimized and reorganized to save
memory.
1. The drv_type uses only limited bits, change the type to 2 bits what
it needs.
2. Align the hairpin_flow_id, drv_type, fdir, copy_applied to 32 bits.
As hairpin never uses the full 32 bits.
3. __rte_packed helps tight up the structure memory layout.
The optimization totally helps save 14 bytes for the structure.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit allocates rte flow from indexed memory pool.
Allocate rte flow memory from indexed memory pool helps save more than
MALLOC_ELEM_OVERHEAD bytes memory from rte_malloc().
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Allocate mark copy resource from indexed pool helps rte flow saves the 4
bytes index instead of 8 bytes pointer. For mark copy resource itself,
it helps save MALLOC_ELEM_OVERHEAD bytes from rte_malloc().
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This patch allocate the meter object memory from indexed memory pool
which will help to save the MALLOC_ELEM_OVERHEAD memory taken by
rte_malloc().
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit converts flow dev handle to indexed.
Change the mlx5 flow handle from pointer to uint32_t saves memory for
flow. With million flow, it saves several MBytes memory.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit converts hrxq to indexed.
Using the uint32_t index instead of pointer saves 4 bytes memory for the
flow handle. For millions flows, it will save several MBytes of memory.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit convert jump resource to indexed.
The table data struct is allocated from indexed memory. As it is add in
the hash list, the pointer is still used for hash list search. The index
is added to the table struct, and the pointer in flow handle is decrease
to uint32_t type. For flow without jump flows, it saves 4 bytes memory.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit converts port id action to indexed.
Using the uint32_t index instead of pointer saves 4 bytes memory for the
flow handle. For millions flows, it will save several MBytes of memory.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit convert tag resource to indexed.
As tag resources are add in the hash list, to avoid introduce
performance issue and keep the hash list, only the tag resource memory
is allocated from indexed memory. The resources is still added to the
hash list. Add four bytes index in the tag resource struct and change
the tag resources in the flow handle from pointer to uint32_t seems be
no benefit for tag resource, but it saves memory for flows without tag
action. And also for sub flows share one tag action resource.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit converts the push VLAN resource to indexed.
Using the uint32_t index instead of pointer saves 4 bytes memory for the
flow handle. For millions flows, it will save several MBytes of memory.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit converts the flow encap/decap resource to indexed.
Using the uint32_t index instead of pointer saves 4 bytes memory for the
flow handle. For millions flows, it will save several MBytes of memory.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Refactor common memory btree and cache management to common driver.
Replace some input parameters of MR APIs to more common data structure
like PD, port_id, share_cache,... so that multiple PMD drivers can
use those MR APIs.
Modify mlx5 net pmd driver to use MR management APIs from common driver.
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Refactor common multi-process handling codes from net PMD to common
driver. Using tuple mp_id{name, port_id} as standard input parameter
for all multi-process IPC APIs instead of using rte_eth_dev.
Modify net PMD to use multi-process APIs from common driver.
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Define a device parameter to configure log 2 of a stride size for MPRQ
- mprq_log_stride_size. User is able to specify a stride size in a range
allowed by an underlying hardware. The default stride size is defined as
2048 bytes to encompass most commonly used packet sizes in the Internet
(MTU 1518 and less) and will be used in case a maximum configured packet
size cannot fit into the largest possible stride size. Otherwise a
stride size is set to a large enough value to encompass a whole packet.
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Currently, the counter struct saves both the members used by batch
counters and none batch counters. The members which are only used
by none batch counters cost 16 bytes extra memory for batch counters.
As normally there will be limited none batch counters, mix the none
batch counter and batch counter members becomes quite expensive for
batch counter. If 1 million batch counters are created, it means 16 MB
memory which will not be used by the batch counters are allocated.
Split the mlx5_flow_counter struct for batch and none batch counters
helps save the memory.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
When creating a hairpin queue, the total data size and the maximal
number of packets are interrelated. The differ is the stride size.
Larger buffer size means big packet like jumbo could be supported,
but in the meanwhile, it will introduce more cache misses and have a
side effect on the performance.
Now a new device parameter "hp_buf_log_sz" is introduced for
applications to set the total data buffer size (the logarithm value).
Then the maximal number of packets will also be calculated
automatically by this value.
Applications could also change this value to a larger one in order
to support larger packets in hairpin case. A smaller value will be
beneficial for memory consumption.
If it is not set, the default value will be used.
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Only the members of flow handle structure will be used when trying
to destroy a flow. Other members of mlx5 device flow resource will
only be used for flow creating, and they could be reused for different
flows.
So only the device flow handle structure needs to be saved for further
usage. This could be separated from the whole mlx5 device flow and
stored with a list for each rte flow.
Other members will be pre-allocated with an array, and an index will
be used to help to apply each device flow to the hardware.
The flow handle sizes of Verbs and DV mode will be different, and
some calculation could be done before allocating a verbs handle.
Then the total memory consumption will less for Verbs when there is
no inbox driver being used.
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
When stopping a mlx5 device, all the flows inserted will be flushed
since they are with non-cached mode. And no more action will be done
for these flows in the device closing stage.
If the device restarts after stopped, no flow with non-cached mode
will be re-inserted.
The flush operation through rte interface will remain the same, and
all the flows will be flushed actively.
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The devices of the family ConnectX may have two letters as suffix.
Such suffix is preceded with a space and the second x is lowercase:
- ConnectX-4 Lx
- ConnectX-5 Ex
- ConnectX-6 Dx
Uppercase of the device family name BlueField is also fixed.
The lists of supported devices are fixed.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
This adds new device id to the list of Mellanox devices
that runs mlx5 PMD.
- BlueField-2 integrated ConnectX-6 Dx network controller
This device is not ready yet, it is in development stage.
Signed-off-by: Raslan Darawsheh <rasland@mellanox.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The header file rte_config.h is always included by make or meson.
If required in an exported API header file, it must be included
in the public header file for external applications.
In the internal files, explicit include of rte_config.h is useless,
and can be removed.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Matan Azrad <matan@mellanox.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
The mpx5 PMD maintains the list of devices for those the memory
operation callback routines must be invoked to keep the device MRs (MR
is the entity backing the hardware DMA transactions) consistent with the
mapped memory.
Each device context in the list is protected with dedicated lock on per
device basis, which might be taken inside the callback routine.
When device is closing the PMD frees all MRs by calling
mlx5_mr_release(), that might call rte_free() under the taken device
lock. If this rte_free call triggers the entire memory segment freeing
it, in its turn, invokes the callback routine and attempt to take the
lock inside this one causes the deadlock.
The patch proposes the remove the device from the callback list first
and then call mlx5_mr_release() and free the remaining device MRs
explicitly.
Fixes: 0e3d0525b2 ("net/mlx5: fix memory event callback list")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Get a burst mode information for Rx/Tx queues in mlx5.
Provide callback functions to show this information in
a "show rxq info" and "show txq info" output.
Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Use the MLX5_ASSERT macros instead of the standard assert clause.
Depends on the RTE_LIBRTE_MLX5_DEBUG configuration option to define it.
If RTE_LIBRTE_MLX5_DEBUG is enabled MLX5_ASSERT is equal to RTE_VERIFY
to bypass the global CONFIG_RTE_ENABLE_ASSERT option.
If RTE_LIBRTE_MLX5_DEBUG is disabled, the global CONFIG_RTE_ENABLE_ASSERT
can still make this assert active by calling RTE_VERIFY inside RTE_ASSERT.
Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Use the RTE_LIBRTE_MLX5_DEBUG configuration flag to get rid of dependency
on the NDEBUG definition. This is a preparation step to switch
from standard assert clauses to DPDK RTE_ASSERT ones in MLX5 driver.
Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The inline feature is designed to save PCI bandwidth by copying some
of the data to the wqe. This feature if enabled works for all packets.
In some cases when using external memory, the PCI bandwidth is not
relevant since the memory can be accessed by other means.
This commit introduce the ability to control the inline with mbuf
granularity.
In order to use this feature the application should register the field
name, and restart the port.
Signed-off-by: Ori Kam <orika@mellanox.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
As an arrangment for Netlink command moving to the common library,
reduce the net/mlx5 dependencies.
Replace ethdev class command parameters.
Improve Netlink sequence number mechanism to be controlled by the
mlx5 Netlink mechanism.
Move mlx5_nl_check_switch_info to mlx5_nl.c since it is the only one
which uses it.
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
There might be a case that one Mellanox device can be probed by
multiple mlx5 drivers.
One case is that any mlx5 vDPA device can be probed by both net/mlx5
and vdpa/mlx5.
Add a new mlx5 common API to get the requested driver by devargs:
class=[net/vdpa].
Skip net/mlx5 PMD probing while the device is selected to be probed by
the vDPA driver.
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Move PCI detection by IB device from mlx5 PMD to the common code.
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
A new Mellanox vdpa PMD will be added to support vdpa operations by
Mellanox adapters.
This vdpa PMD design includes mlx5_glue and mlx5_devx operations and
large parts of them are shared with the net/mlx5 PMD.
Create a new common library in drivers/common for mlx5 PMDs.
Move mlx5_glue, mlx5_devx_cmds and their dependencies to the new mlx5
common library in drivers/common.
The files mlx5_devx_cmds.c, mlx5_devx_cmds.h, mlx5_glue.c,
mlx5_glue.h and mlx5_prm.h are moved as is from drivers/net/mlx5 to
drivers/common/mlx5.
Share the log mechanism macros.
Separate also the log mechanism to allow different log level control to
the common library.
Build files and version files are adjusted accordingly.
Include lines are adjusted accordingly.
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The DevX commands interface is included in the mlx5.h file with a lot
of other PMD interfaces.
As an arrangement to make the DevX commands shared with different PMDs,
this patch moves the DevX interface to a new file called mlx5_devx_cmds.h.
Also remove shared device structure dependency on DevX commands.
Replace the DevX commands log mechanism from the mlx5 driver log
mechanism to the EAL log mechanism.
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Flow with meter will split to three subflows, the prefix subflow with
meter action do the color, the meter subflow filter the packets, the
suffix subflow do all the left actions for packets pass the filter.
Both the color and the subflow match between prefix and suffix use the
register to store the tag.
For some of the NICs with meter color register share capability, it
only uses 8 LSB of the register for color, the left 24 MSB can be used
for flow id match between meter prefix subflow and suffix subflow.
Currently, one entire register is allocated for flow matching which
causes the NICs with limited registers don't have enough register for
other matching.
Add the meter color share capability checking to fix lacking of
registers issue.
Fixes: 9ea9b049a9 ("net/mlx5: split meter flow")
Cc: stable@dpdk.org
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The id allocated is for the register unique id match. Some registers may
not use the full 32 bits. Add the maximum id to avoid allocate id over
the register restriction.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Add pmd unix socket server to enable external tool applications to
trigger flow dump.
Socket path:
/var/tmp/dpdk_mlx5_<pid>
Socket format:
io_raw: port_id of uint16
file: file descriptor of int
Signed-off-by: Xueming Li <xuemingl@mellanox.com>
Signed-off-by: Xiaoyu Min <jackmin@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The eth devices which share one ibv device only need one hash list of
flow table.
Currently, flow table hash list is created per each eth device
whatever whether they share one ibv device or not.
If the devices share one ibv device, the previously created hash list
will become dangle because the pointer point to (sh->flow_tbls) is
overwritten by the later created hast list.
To fix this, just don't create hash list if it is already created.
Fixes: 54534725d2 ("net/mlx5: fix flow table hash list conversion")
Cc: stable@dpdk.org
Reported-by: Zhike Wang <wangzhike@jd.com>
Signed-off-by: Xiaoyu Min <jackmin@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
GENEVE is available in tunnel offloads. Add it as the default support
option.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Ori Kam <orika@mellanox.com>
ConnectX-4LX supports multiple packets within the single Tx
descriptor. This feature is named as "Legacy Multi-Packet Write"
and imposes a lot of limitations:
- no ACLs, it means no NIC Tx Flows are supported and Tx metadata
become meaningless
- the required minimal inline data must be zero
- no SR-IOV, it means no support in E-Switch configurations,
- no priority and dscp forcing
- no VLAN insertion
- no TSO
- all packets within MPW session must have the same size
This legacy MPW feature is mainly intended for test purposes.
To explicitly engage the feature on ConnectX-4LX the devargs
should be specified:
- txq_mpw_en=1
This feature was dropped in 19.08, this patch reverts it back.
Fixes: 18a1c20044 ("net/mlx5: implement Tx burst template")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Commit in fixes line sets the DV (Direct Verbs) flow engine as default.
Newer versions of DV flow engine use the DR (Direct Rules) features.
DR is supported from RDMA Core library version rdma-core-24.0.
This cause failure to start port when using older rdma-core version,
without DR support.
This patch selects DV flow engine if rdma-core version is v24.0 or
higher. Verbs flow engine is selected otherwise.
Fixes: cd4569d2bf ("net/mlx5: change default flow engine to DV")
Signed-off-by: Dekel Peled <dekelp@mellanox.com>
Acked-by: Ori Kam <orika@mellanox.com>
When DR is not supported and DV is supported, tag action still can be
used by the metadata feature.
Wrongly, the tag hash list was not created what caused failure in
metadata action creation.
Create the tag hash list for each DV case.
Fixes: 860897d289 ("net/mlx5: reorganize flow tables with hash list")
Signed-off-by: Matan Azrad <matan@mellanox.com>
As the result of testing it was found that some hosts have
the performance penalty imposed by required write memory barrier
after doorbell writing. Before 19.08 release there was some
heuristics to decide whether write memory barrier should be
performed. For the bursts of recommended size (or multiple)
it was supposed there were some extra ongoing packets in the
next burst and write memory barrier may be skipped (supposed
to be performed in the next burst, at least after descriptor
writing).
This patch restores that behaviour, the devargs tx_db_nc=2
must be specified to engage this performance tuning feature.
Fixes: 8409a28573 ("net/mlx5: control transmit doorbell register mapping")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The default flow engine is Verbs flow engine, for legacy reasons.
This patch changes the default to DV flow engine (dv_flow_en = 1).
Documentation is updated accordingly.
Signed-off-by: Dekel Peled <dekelp@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The rdma_core routine mlx5dv_dr_create_flow_action_dest_vport()
requires the vport id parameter to create port action.
The register c[0] value was used to deduce the port id value
and it fails in bonding configuration. The correct way is
to apply vport_num value queried from the rdma_core library.
Fixes: f07341e7ae ("net/mlx5: update source and destination vport translations")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
For the case when DR is not supported and DV is supported:
multi-tables feature is off.
In this case, only table 0 is supported.
Table 0 structure wrongly was not created what prevented any
matcher object to be created and even caused crashes.
Create the table hash list in DV case too.
Create table zero empty structure for each domain when DR is not
supported.
Allow NULL DR internal table object to be used.
Fixes: 860897d289 ("net/mlx5: reorganize flow tables with hash list")
Signed-off-by: Matan Azrad <matan@mellanox.com>
The state of environment variable MLX5_BF_SHUT_UP was not
recovered correctly if there was no tx_db_nc devarg specified.
Fixes: 8409a28573 ("net/mlx5: control transmit doorbell register mapping")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Tag action for flow mark/flag could be reused by different flows.
When creating a new flow with mark, the existing tag resources will
be traversed in order to confirm if the action is already created.
If only one linked list is used, the searching rate will drop
significantly with the number of tag actions increasing.
By using a hash lists table, it will speed up the searching process
and in the meanwhile, the memory consumption won't be large if only
a small number tag action resources are created(compared to other
hash table implementations). The list heads array size could be
optimized with some extendable hash table in the future.
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
In the current flow tables organization, arrays are used. This is
fast for searching, creating related object that will be used in
flow creation. But it introduces some limitation to the table index.
Then we can reorganize the flow tables information with hash list.
When using hash list, there is no need to maintain three arrays for
NIC TX, RX and FDB tables object information.
This attribute could be used together with the table ID to generate
a 64-bits key that is unique for the hash list insertion, lookup and
deletion.
Signed-off-by: Bing Zhao <bingz@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The rdma core library can map doorbell register in two ways,
depending on the environment variable "MLX5_SHUT_UP_BF":
- as regular cached memory, the variable is either missing or
set to zero. This type of mapping may cause the significant
doorbell register writing latency and requires explicit
memory write barrier to mitigate this issue and prevent
write combining.
- as non-cached memory, the variable is present and set to
not "0" value. This type of mapping may cause performance
impact under heavy loading conditions but the explicit write
memory barrier is not required and it may improve core
performance.
The new devarg is introduced "tx_db_nc", if this parameter is
set to zero, the doorbell register is forced to be mapped to
cached memory and requires explicit memory barrier after
writing to. If "tx_db_nc" is set to non-zero value the doorbell
will be mapped as non-cached memory, not requiring the memory
barrier. If "tx_db_nc" is missing the behaviour will be defined
by presence of "MLX5_SHUT_UP_BF" in environment. If variable
is missed the default value zero will be set for ARM64 hosts
and one for others.
In run time the code checks the mapping type and provides the
memory barrier after writing to tx doorbell register if it is
needed. The mapping type is extracted directly from the
uar_mmap_offset field in the queue properties.
Fixes: 18a1c20044 ("net/mlx5: implement Tx burst template")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
When the port is closed or program exits ungraceful, the meter rulers
should be flushed after the flow destroyed.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This commit add the basic meter operations for meter create and destroy.
New internal functions in rte_mtr_ops callback:
1. create()
2. destroy()
The create() callback will create the corresponding flow rules on the
meter table.
The destroy() callback destroys the flow rules on the meter table.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This commit add the support of meter profile add and delete operations.
New internal functions in rte_mtr_ops callback:
1. meter_profile_add()
2. meter_profile_delete()
Only RTE_MTR_SRTCM_RFC2697 algorithm is supported and can be added. To
add other algorithm will report an error.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Meter need the metadata REG_C to have the color match between the prefix
flow and the meter flow.
As the user define or metadata feature will both use the REG_C in the
suffix flow, the color match register meter uses will not impact the
register use in the later sub flow.
Another case is that tag is add before meter flow. In this case, meter
should not touch the register the tag action is using. To avoid that
case, meter should reserve the REG_C's used by user defined MLX5_APP_TAG.
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This commit add the support of fill and get the meter capabilities
from DevX.
Support items:
1. The srTCM color bind mode.
2. Meter share with multiple flows.
3. Action drop.
The color aware mode and multiple meter chaining in a flow are not
supported.
New internal function in rte_mtr_ops callback:
1. capabilities_get()
Signed-off-by: Suanming Mou <suanmingm@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
While reg_c[meta] can be copied to reg_b simply by modify-header
action (it is supported by hardware), it is not possible to copy
reg_c[mark] to the STE flow_tag as flow_tag is not a metadata
register and this is not supported by hardware. Instead, it
should be manually set by a flow per each unique MARK ID. For
this purpose, there should be a dedicated flow table -
RX_CP_TBL and all the Rx flow should pass by the table
to properly copy values from the register to flow tag field.
And for each MARK action, a copy flow should be added
to RX_CP_TBL according to the MARK ID like:
(if reg_c[mark] == mark_id),
flow_tag := mark_id / reg_b := reg_c[meta] / jump to RX_ACT_TBL
For SET_META action, there can be only one default flow like:
reg_b := reg_c[meta] / jump to RX_ACT_TBL
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Values set by MARK and SET_META actions should be carried over
to the VF representor in case of flow miss on Tx path. However,
as not all metadata registers are preserved across the different
domains (NIC Rx/Tx and E-Switch FDB), as a workaround, those
values should be carried by reg_c's which are preserved across
domains and copied to STE flow_tag (MARK) and reg_b (META) fields
in the last stage of flow steering, in order to scatter those
values to flow_tag and flow_table_metadata of CQE.
While reg_c[meta] can be copied to reg_b simply by modify-header
action (it is supported by hardware), it is not possible to copy
reg_c[mark] to the STE flow_tag as flow_tag is not a metadata
register and this is not supported by hardware. Instead, it should
be manually set by a flow per MARK ID. For this purpose, there
should be a dedicated flow table - RX_CP_TBL and all the Rx flow
should pass by the table to properly copy values.
As the last action of Rx flow steering must be a terminal action
such as QUEUE, RSS or DROP, if a user flow has Q/RSS action, the
flow must be split in order to pass by the RX_CP_TBL. And the
remained Q/RSS action will be performed by another dedicated
action table - RX_ACT_TBL.
For example, for an ingress flow:
pattern,
actions_having_QRSS
it must be split into two flows. The first one is,
pattern,
actions_except_QRSS / copy (reg_c[2] := flow_id) / jump to RX_CP_TBL
and the second one in RX_ACT_TBL.
(if reg_c[2] == flow_id),
action_QRSS
where flow_id is uniquely allocated and managed identifier.
This patch implements the Rx flow splitting and build the RX_ACT_TBL.
Also, per each egress flow on NIC Tx, a copy action (reg_c[]= reg_a)
should be added in order to transfer metadata from WQE.
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The metadata register reg_c[0] might be used by kernel or
firmware for their internal purposes. The actual used mask
can be queried from the kernel. The remaining bits can be
used by PMD to provide META or MARK feature. The code queries
the mask of reg_c[0] and adjust the resource usage dynamically.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The PMD parameter dv_xmeta_en is added to control extensive
metadata support. A nonzero value enables extensive flow
metadata support if device is capable and driver supports it.
This can enable extensive support of MARK and META item of
rte_flow. The newly introduced SET_TAG and SET_META actions
do not depend on dv_xmeta_en parameter, because there is
no compatibility issue for new entities. The dv_xmeta_en is
disabled by default.
There are some possible configurations, depending on parameter
value:
- 0, this is default value, defines the legacy mode, the MARK
and META related actions and items operate only within NIC Tx
and NIC Rx steering domains, no MARK and META information
crosses the domain boundaries. The MARK item is 24 bits wide,
the META item is 32 bits wide.
- 1, this engages extensive metadata mode, the MARK and META
related actions and items operate within all supported steering
domains, including FDB, MARK and META information may cross
the domain boundaries. The ``MARK`` item is 24 bits wide, the
META item width depends on kernel and firmware configurations
and might be 0, 16 or 32 bits. Within NIC Tx domain META data
width is 32 bits for compatibility, the actual width of data
transferred to the FDB domain depends on kernel configuration
and may be vary. The actual supported width can be retrieved
in runtime by series of rte_flow_validate() trials.
- 2, this engages extensive metadata mode, the MARK and META
related actions and items operate within all supported steering
domains, including FDB, MARK and META information may cross
the domain boundaries. The META item is 32 bits wide, the MARK
item width depends on kernel and firmware configurations and
might be 0, 16 or 24 bits. The actual supported width can be
retrieved in runtime by series of rte_flow_validate() trials.
If there is no E-Switch configuration the ``dv_xmeta_en`` parameter is
ignored and the device is configured to operate in legacy mode (0).
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The metadata registers reg_c provide support for TAG and
SET_TAG features. Although there are 8 registers are available
on the current mlx5 devices, some of them can be reserved.
The availability should be queried by iterative trial-and-error
implemented by mlx5_flow_discover_mreg_c() routine.
If reg_c is available, it can be regarded inclusively that
the extensive metadata support is possible. E.g. metadata
register copy action, supporting 16 modify header actions
(instead of 8 by default) preserving register across
different domains (FDB and NIC) and so on.
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This adds new device id to the list of Mellanox devices
that runs mlx5 PMD.
- ConnectX-6DX device ID
- ConnectX-6DX SRIOV device ID
Signed-off-by: Raslan Darawsheh <rasland@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
DRV_LOG macro is used to print log messages, one per line.
In several locations this macro is used with redundant '\n' character
at the end of the log message, causing blank lines between log lines.
This patch removes the '\n' character where it is redundant.
Signed-off-by: Dekel Peled <dekelp@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Since the encap action is not supported in RX, we need to split the
hairpin flow into RX and TX.
Signed-off-by: Ori Kam <orika@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
When splitting flows for example in hairpin / metering, there is a need
to combine the flows. This is done using ID.
This commit introduce a simple way to generate such IDs.
The reason why bitmap was not used is due to fact that the release and
allocation are O(n) while in the chosen approch the allocation and
release are O(1)
Signed-off-by: Ori Kam <orika@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commits adds the hairpin get capabilities function.
Signed-off-by: Ori Kam <orika@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit adds the support for creating Tx hairpin queues.
Hairpin queue is a queue that is created using DevX and only used
by the HW.
Signed-off-by: Ori Kam <orika@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Currently all Tx queues are created using Verbs.
This commit modify the naming so it will not include verbs,
since in next commit a new type will be introduce (hairpin)
Signed-off-by: Ori Kam <orika@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit adds the support for creating Rx hairpin queues.
Hairpin queue is a queue that is created using DevX and only used
by the HW. This results in that all the data part of the RQ is not being
used.
Signed-off-by: Ori Kam <orika@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Rx queue for LRO is created using DevX. Flows created on this queue
must use the DV flow engine.
This patch adds check of dv_flow_en=1 when configuring LRO support
on device spawn.
Documentation is updated accordingly.
Fixes: 175f1c21d0 ("net/mlx5: check conditions to enable LRO")
Cc: stable@dpdk.org
Signed-off-by: Dekel Peled <dekelp@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The DevX counter management triggers an asynchronous event to get back
the new counters values from the HW.
The counter management doesn't trigger 2 parallel events for the same
pool, hence, the pool cannot be updated again in the event waiting time.
When the port is stopped, the DevX event mechanism wrongly was
destroyed what remained all the waiting pools in waiting state forever.
As a result, the counters of the stuck pools were never updated again.
Separate the DevX interrupt installation from the dev installation and
remove the DevX interrupt unregistration\registration from the
stop\start operations.
Now, the DevX interrupt should be installed in probe and uninstalled in
close.
Cc: stable@dpdk.org
Fixes: f15db67df0 ("net/mlx5: accelerate DV flow counter query")
Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The routine mlx5dv_query_devx_port() was called directly
instead of using the mlx5 glue thunk.
Fixes: d5c06b1b10 ("net/mlx5: query vport index match mode and parameters")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
In LAG configuration the devices in the same switch domain
might be spawned on the base of different PCI devices, so
we should check all devices backed by mlx5 PMD whether they
belong to specified switch domain. When the new devices are
being created it is not possible to detect whether the
sibling devices created in the current probe() loop belong
to the driver, driver field is not filled yet (it will be
done on returned success of current probe()). This patch
updates the device scanning, allowing extra match on
current backing PCI device, is being used to create siblings.
Fixes: f7e95215ac ("net/mlx5: extend switch domain searching range")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The devices backed by mlx5 PMD might share the same multiport
Infiniband device context. It regards representors and slaves
of bonding device. These ports are spawned with devargs.
These patch check whether configuration deduced from these
devargs is compatible with configurations if devices
sharing the same context. It prevents the incorrect
whitelists, like:
-w 82:00.0,representor=0,dv_flow_en=1
-w 82:00.0,representor=1,dv_flow_en=0
The representors with indices [0-1] are supposed to spawned
over the same PCi device, but there is dv_flow_en parameter
mismatch.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
With bonding configuration multiple PFs may represent the
single switching device with multiple ports as representors.
To distinguish representors belonging to different PFs we
should generated unique port ID. It is proposed to use
the PF index in bonding configuration to generate this
unique port IDs.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
With bonding configurations the switch domain may be shared
between multiple PCI devices, we should search the switch
sibling devices within the entire set of present ethernet
devices backed by the mlx5 PMD.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
There new kernel/rdma_core [1] supports matching on metadata
register instead of vport field to provide operations over
VF LAG bonding configurations. The patch retrieves parameters
and information about the way is engaged to match vport on E-Switch.
[1] http://patchwork.ozlabs.org/cover/1122170/
"Mellanox, mlx5 vport metadata matching"
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
If bonding Infiniband device is found the unified E-Switch
is supposed and the extra rdma-core/kernel support is needed
to retrieve vport indices. The patch introduces this feature
defines, bonding support check is added to probe routine.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
If device is VF LAG bonding one the port name includes
the bonding Infiniband device name and looks like:
82:00.0_mlx5_bond_0 - for master device port PF0
82:00.1_mlx5_bond_0_representor_5 - for representor
VF5 over PF1
where bonding Infiniband device mlx5_bond_0 controls
the 82:00.0 as PF0 and 82:00.1 as PF1 PCI functions.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The Mellanox NICs starting from ConnectX-5 support LAG over
NIC ports internally, implemented by the NIC firmware and hardware.
The multiport NIC presents multiple physical PCI functions (PF),
with SR-IOV multiple virtual PCI functions (VFs) might be presented.
With switchdev mode the VF representors are engaged and PFs and their
VFs are connected by internal E-Switch feature. Each PF and related VFs
have dedicated E-Switch and belong to dedicated switch domain.
If NIC ports are combined to support NIC the kernel drivers introduce
the single unified Infiniband multiport devices, and all only one
unified E-Switch with single switch domain combines master PF
all all VFs. No extra DPDK bonding device is needed.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
At device probing the device list to spawn was allocated
as dynamic size local variable. It was no possible to have
one unified exit point from routine due to compiler warnings.
This patch allocates the spawn device list directly with
rte_zmalloc() and it is possible to goto to unified exit
label from anywhere of the routine.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The routine mlx5_ibv_device_to_pci_addr() takes Infiniband
device list object, takes the device sysfs path from there
and retrieves PCI address. The routine may be implemented
in more generic way by taking sysfs path directly as parameter
and can be used for getting PCI address of netdevs.
The generic routine is renamed to mlx5_dev_to_pci_addr()
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Now all devices created over the same multiport IB device
have shared context containing the backing PCI device field.
For the VF LAG configurations it becomes possible the
representors might be connected to VF created over different
PFs. In this case representors have the different backing
PCI devices and mentioned field should be moved to device
private area.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The PCI virtual function type was not recognized correctly
for ConnectX-6 VF.
Fixes: f0354d8423 ("net/mlx5: add ConnectX-6 device IDs")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The PCI virtual function type was not recognized correctly
for BlueField VF.
Fixes: f38c54571d ("net/mlx5: split PCI from generic probing")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
This patch implements ethdev operations get_module_info and
get_module_eeprom, to support ethtool commands ETHTOOL_GMODULEINFO
and ETHTOOL_GMODULEEEPROM.
New functions mlx5_get_module_info() and mlx5_get_module_eeprom()
added in mlx5_ethdev.c.
Signed-off-by: Dekel Peled <dekelp@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
This commit adds support for RTE_FLOW_ACTION_TYPE_OF_POP_VLAN via
direct verbs flow rules.
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Restrict this header inclusion to its real users.
Fixes: 028669bc9f ("eal: hide shared memory config")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
This adds support for adding a new UDP tunnel port
on a specific VXLAN types.
Currently we only support VXLAN, VXLAN-GPE on ports
4789, 4790 respectively. Without having to configure
anything in the NIC.
Signed-off-by: Raslan Darawsheh <rasland@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The shared Infiniband device context should be included
into memory event callback list only once on context creation,
and removed from the list only once on context destroying.
Multiple insertions of the same object caused the infinite
loop on the list processing.
Fixes: ccb3815346 ("net/mlx5: update memory event callback for shared context")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
The patch [Fixes] sets the default value of required minimal
inline data to 0 bytes. On some configurations (depends
on switchdev/legacy settings and FW version/settings)
the ConnectX-4LX NIC requires minimal 18 bytes of
Tx descriptor inline data to operate correctly.
Wrongly set to 0 default value may prevent NIC from operating
with out-of-the-box settings, this patch reverts default
value for ConnectX-4LX back to 18 bytes (inline L2).
Fixes: 9f350504bb ("net/mlx5: fix ConnectX-4LX minimal inline data limit")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
The function mlx5_set_min_inline() includes a switch() that checks
various PCI device IDs in order to set the txq_inline_min value. No
value is set when the PCI device ID matches the ConnectX-5 adapters,
resulting in an assert() failure later in the function
mlx5_set_txlimit_params().
This error was encountered on an IBM Power 9 system running RHEL 7.6
w/o Mellanox OFED installed.
Fixes: 38b4b397a5 ("net/mlx5: add Tx configuration and setup")
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
On some virtual setups (particularly on ESXi) when we have SR-IOV and
E-Switch enabled there is the problem to receive VLAN traffic on VF
interfaces. The NIC driver in ESXi hypervisor does not setup E-Switch
vport setting correctly and VLAN traffic targeted to VF is dropped.
The patch provides the temporary workaround - if the rule
containing the VLAN pattern is being installed for VF the VLAN
network interface over VF is created, like the command does:
ip link add link vf.if name mlx5.wa.1.100 type vlan id 100
The PMD in DPDK maintains the database of created VLAN interfaces
for each existing VF and requested VLAN tags. When all of the RTE
Flows using the given VLAN tag are removed the created VLAN interface
with this VLAN tag is deleted.
The name of created VLAN interface follows the format:
evmlx.d1.d2, where d1 is VF interface ifindex, d2 - VLAN ifindex
Implementation limitations:
- mask in rules is ignored, rule must specify VLAN tags exactly,
no wildcards (which are implemented by the masks) are allowed
- virtual environment is detected via rte_hypervisor() call,
and the type of hypervisor is checked. Currently we engage
the workaround for ESXi and unrecognized hypervisors (which
always happen on platforms other than x86 - it means workaround
applied for the Flow over PCI VF). There are no confirmed data
the other hypervisors (HyperV, Qemu) need this workaround,
we are trying to reduce the list of configurations on those
workaround should be applied.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
Mellanox ConnectX-4LX NIC in configurations with disabled
E-Switch can operate without minimal required inline data
into Tx descriptor. There was the hardcoded limit set to
18B in PMD, fixed to be no limit (0B).
Fixes: 38b4b397a5 ("net/mlx5: add Tx configuration and setup")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Matan Azrad <matan@mellanox.com>