A PMD parameter (rxq_cqe_pad_en) is added to enable 128B padding of CQE on
RX side. The size of CQE is aligned with the size of a cacheline of the
core. If cacheline size is 128B, the CQE size is configured to be 128B even
though the device writes only 64B data on the cacheline. This is to avoid
unnecessary cache invalidation by device's two consecutive writes on to one
cacheline. However in some architecture, it is more beneficial to update
entire cacheline with padding the rest 64B rather than striding because
read-modify-write could drop performance a lot. On the other hand, writing
extra data will consume more PCIe bandwidth and could also drop the maximum
throughput. It is recommended to empirically set this parameter. Disabled
by default.
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
Removed DEV_RX_OFFLOAD_CRC_STRIP offload flag.
Without any specific Rx offload flag, default behavior by PMDs is to
strip CRC.
PMDs that support keeping CRC should advertise DEV_RX_OFFLOAD_KEEP_CRC
Rx offload capability.
Applications that require keeping CRC should check PMD capability first
and if it is supported can enable this feature by setting
DEV_RX_OFFLOAD_KEEP_CRC in Rx offload flag in rte_eth_dev_configure()
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Tomasz Duszynski <tdu@semihalf.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Jan Remes <remes@netcope.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Hyong Youb Kim <hyonkim@cisco.com>
The size of Rx queue is determined by dividing the number of descriptors by
the number of strides. As device can't support single slot queue, if the
number of descriptors is same as the number of strides, MPRQ shouldn't be
enabled. Otherwise, this will cause HW fault. For example, if rxd is set to
512 with testpmd on ConnectX-4 Lx, PMD can't receive more than 512 packets
because the minimum number of strides for ConnectX-4 Lx is 512. Users have
to configure larger number of descriptors in this case.
Fixes: 7d6bf6b866b8 ("net/mlx5: add Multi-Packet Rx support")
Cc: stable@dpdk.org
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
If MPRQ is enabled, a PMD-private mempool is allocated. For ConnectX-4 Lx,
the minimum number of strides is 512 which ConnectX-5 supports 8. This
results in quite small number of elements for the MPRQ mempool. For
example, if the size of Rx ring is configured as 512, only one MPRQ buffer
can cover the whole ring. If there's only one Rx queue is configured. In
the following code in mlx5_mprq_alloc_mp(), desc is 1 and obj_num will be
36 as a result.
desc *= 4;
obj_num = desc + MLX5_MPRQ_MP_CACHE_SZ * priv->rxqs_n;
However, rte_mempool_create_empty() has a sanity check to refuse large
per-lcore cache size compared to the number of elements. Cache flush
threshold should not exceed the number of elements of a mempool. For the
above example, the threshold is 32 * 1.5 = 48 which is larger than 36 and
it fails to create the mempool.
Fixes: 7d6bf6b866b8 ("net/mlx5: add Multi-Packet Rx support")
Cc: stable@dpdk.org
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
There are dedicated QP attributes, tunnel offload flag and mask, which
must be configured in order to allow part of the HW tunnel offloads.
So, if a QP is pointed by a tunnel flow, the above QP attributes
should be configured.
The mask configuration is wrongly only performed if an internal RSS was
configured by the user, while there is no reason to condition the
tunnel offloads in RSS configurations.
Consequently, some of the tunnel offloads was not performed by the HW
when a tunnel flow was configured, for example, the packet tunnel
types was not reported to the user.
Replace the internal RSS condition with the tunnel flow condition.
Fixes: df6afd377ace ("net/mlx5: remove useless arguments in hrxq API")
Signed-off-by: Matan Azrad <matan@mellanox.com>
This patch adds support for building and running mlx5 PMD on
32bit systems such as i686.
The main issue to tackle was handling the 32bit access to the UAR
as quoted from the mlx5 PRM:
QP and CQ DoorBells require 64-bit writes. For best performance, it
is recommended to execute the QP/CQ DoorBell as a single 64-bit write
operation. For platforms that do not support 64 bit writes, it is
possible to issue the 64 bits DoorBells through two consecutive
writes,
each write 32 bits, as described below:
* The order of writing each of the Dwords is from lower to upper
addresses.
* No other DoorBell can be rung (or even start ringing) in the midst
of an on-going write of a DoorBell over a given UAR page.
The last rule implies that in a multi-threaded environment, the access
to a UAR page (which can be accessible by all threads in the process)
must be synchronized (for example, using a semaphore) unless an atomic
write of 64 bits in a single bus operation is guaranteed. Such a
synchronization is not required for when ringing DoorBells on different
UAR pages.
Signed-off-by: Moti Haimovsky <motih@mellanox.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
RSS level is necessary to had a bit in the hash_fields which is already
provided in this API, for the tunnel, it is necessary to request such
queue to compute the checksum on the inner most, this last one should
always be activated.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
ConnectX 4-5 support only 40 bytes of RSS key, using a compiled size
hash key is not necessary.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
Drop queues are essentially used in flows due to Verbs API, the
information if the fate of the flow is a drop or not is already present
in the flow. Due to this, drop queues can be fully mapped on regular
queues.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
DEV_RX_OFFLOAD_KEEP_CRC offload flag is added. PMDs that support
keeping CRC should advertise this offload capability.
DEV_RX_OFFLOAD_CRC_STRIP flag will remain one more release
default behavior in PMDs are to keep the CRC until this flag removed
Until DEV_RX_OFFLOAD_CRC_STRIP flag is removed:
- Setting both KEEP_CRC & CRC_STRIP is INVALID
- Setting only CRC_STRIP PMD should strip the CRC
- Setting only KEEP_CRC PMD should keep the CRC
- Not setting both PMD should keep the CRC
A helper function rte_eth_dev_is_keep_crc() has been added to be able to
change the no flag behavior with minimal changes in PMDs.
The PMDs that doesn't report the DEV_RX_OFFLOAD_KEEP_CRC offload can
remove rte_eth_dev_is_keep_crc() checks next release, related code
commented to help the maintenance task.
And DEV_RX_OFFLOAD_CRC_STRIP has been added to virtual drivers since
they don't use CRC at all, when an application requires this offload
virtual PMDs should not return error.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Allain Legacy <allain.legacy@windriver.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Multi-Packet Receive Queue is to receive multiple packets on a single large
buffer. The number of consumed strides in CQE is accumulated to keep track
of the current stride index. However, it is safer to directly use stride
index in CQE to avoid out-of-order situation which can possibly be caused
by introducing LRO in the future.
If Rx CQE compression is enabled, HW can be configured to store the stride
index in a mini-CQE but this will need newer version of library/driver.
Therefore, since this change, MPRQ is only supported with the newer
library/driver and Rx hash result is not supported if MPRQ is enabled along
with Rx CQE compression.
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
Split maintainers logs from user logs.
A lot of debug logs are present providing internal information on how
the PMD works to users. Such logs should not be available for them and
thus should remain available only when the PMD is compiled in debug
mode.
This commits removes some useless debug logs, move the Maintainers ones
under DEBUG and also move dump into debug mode only.
Cc: stable@dpdk.org
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe
bandwidth by posting a single large buffer for multiple packets. Instead of
posting a buffer per a packet, one large buffer is posted in order to
receive multiple packets on the buffer. A MPRQ buffer consists of multiple
fixed-size strides and each stride receives one packet.
Rx packet is mem-copied to a user-provided mbuf if the size of Rx packet is
comparatively small, or PMD attaches the Rx packet to the mbuf by external
buffer attachment - rte_pktmbuf_attach_extbuf(). A mempool for external
buffers will be allocated and managed by PMD.
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
This is the new design of Memory Region (MR) for mlx PMD, in order to:
- Accommodate the new memory hotplug model.
- Support non-contiguous Mempool.
There are multiple layers for MR search.
L0 is to look up the last-hit entry which is pointed by mr_ctrl->mru (Most
Recently Used). If L0 misses, L1 is to look up the address in a fixed-sized
array by linear search. L0/L1 is in an inline function -
mlx5_mr_lookup_cache().
If L1 misses, the bottom-half function is called to look up the address
from the bigger local cache of the queue. This is L2 - mlx5_mr_addr2mr_bh()
and it is not an inline function. Data structure for L2 is the Binary Tree.
If L2 misses, the search falls into the slowest path which takes locks in
order to access global device cache (priv->mr.cache) which is also a B-tree
and caches the original MR list (priv->mr.mr_list) of the device. Unless
the global cache is overflowed, it is all-inclusive of the MR list. This is
L3 - mlx5_mr_lookup_dev(). The size of the L3 cache table is limited and
can't be expanded on the fly due to deadlock. Refer to the comments in the
code for the details - mr_lookup_dev(). If L3 is overflowed, the list will
have to be searched directly bypassing the cache although it is slower.
If L3 misses, a new MR for the address should be created -
mlx5_mr_create(). When it creates a new MR, it tries to register adjacent
memsegs as much as possible which are virtually contiguous around the
address. This must take two locks - memory_hotplug_lock and
priv->mr.rwlock. Due to memory_hotplug_lock, there can't be any
allocation/free of memory inside.
In the free callback of the memory hotplug event, freed space is searched
from the MR list and corresponding bits are cleared from the bitmap of MRs.
This can fragment a MR and the MR will have multiple search entries in the
caches. Once there's a change by the event, the global cache must be
rebuilt and all the per-queue caches will be flushed as well. If memory is
frequently freed in run-time, that may cause jitter on dataplane processing
in the worst case by incurring MR cache flush and rebuild. But, it would be
the least probable scenario.
To guarantee the most optimal performance, it is highly recommended to use
an EAL option - '--socket-mem'. Then, the reserved memory will be pinned
and won't be freed dynamically. And it is also recommended to configure
per-lcore cache of Mempool. Even though there're many MRs for a device or
MRs are highly fragmented, the cache of Mempool will be much helpful to
reduce misses on per-queue caches anyway.
'--legacy-mem' is also supported.
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
This patch removes current support of Memory Region (MR) in order to
accommodate the dynamic memory hotplug patch. This patch can be compiled
but traffic can't flow and HW will raise faults. Subsequent patches will
add new MR support.
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
This patch check if a input requested offloading is valid or not.
Any reuqested offloading must be supported in the device capabilities.
Any offloading is disabled by default if it is not set in the parameter
dev_conf->[rt]xmode.offloads to rte_eth_dev_configure() and
[rt]x_conf->offloads to rte_eth_[rt]x_queue_setup().
If any offloading is enabled in rte_eth_dev_configure() by application,
it is enabled on all queues no matter whether it is per-queue or
per-port type and no matter whether it is set or cleared in
[rt]x_conf->offloads to rte_eth_[rt]x_queue_setup().
If a per-queue offloading hasn't be enabled in rte_eth_dev_configure(),
it can be enabled or disabled for individual queue in
ret_eth_[rt]x_queue_setup().
A new added offloading is the one which hasn't been enabled in
rte_eth_dev_configure() and is reuqested to be enabled in
rte_eth_[rt]x_queue_setup(), it must be per-queue type,
otherwise trigger an error log.
The underlying PMD must be aware that the requested offloadings
to PMD specific queue_setup() function only carries those
new added offloadings of per-queue type.
This patch can make above such checking in a common way in rte_ethdev
layer to avoid same checking in underlying PMD.
This patch assumes that all PMDs in 18.05-rc2 have already
converted to offload API defined in 17.11 . It also assumes
that all PMDs can return correct offloading capabilities
in rte_eth_dev_infos_get().
In the beginning of [rt]x_queue_setup() of underlying PMD,
add offloads = [rt]xconf->offloads |
dev->data->dev_conf.[rt]xmode.offloads; to keep same as offload API
defined in 17.11 to avoid upper application broken due to offload
API change.
PMD can use the info that input [rt]xconf->offloads only carry
the new added per-queue offloads to do some optimization or some
code change on base of this patch.
Signed-off-by: Wei Dai <wei.dai@intel.com>
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
rte_eth_devices[] is not shared between primary and secondary process, but
a static array to each process. The reverse pointer of device (priv->dev)
is invalid. Instead, priv has the pointer to shared data of the device,
struct rte_eth_dev_data *dev_data;
Two macros are added,
#define PORT_ID(priv) ((priv)->dev_data->port_id)
#define ETH_DEV(priv) (&rte_eth_devices[PORT_ID(priv)])
Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Dump verb flow detail including flow spec type and size for debugging
purpose.
Signed-off-by: Xueming Li <xuemingl@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Tunnel RSS level of flow RSS action offers user a choice to do RSS hash
calculation on inner or outer RSS fields. Testpmd flow command examples:
GRE flow inner RSS:
flow create 0 ingress pattern eth / ipv4 proto is 47 / gre / end
actions rss queues 1 2 end level 1 / end
GRE tunnel flow outer RSS:
flow create 0 ingress pattern eth / ipv4 proto is 47 / gre / end
actions rss queues 1 2 end level 0 / end
Signed-off-by: Xueming Li <xuemingl@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Once tunnel packet type(RTE_PTYPE_TUNNEL_xxx) identified,
PKT_RX_IP_CKSUM_XXX and PKT_RX_L4_CKSUM_XXX represent checksum result of
inner headers, outer L3 and L4 header checksum are always valid as soon
as tunnel identified. If no tunnel identified, PKT_RX_IP_CKSUM_XXX and
PKT_RX_L4_CKSUM_XXX represent checksum result of outer L3 and L4
headers.
Signed-off-by: Xueming Li <xuemingl@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
This patch introduced tunnel type identification based on flow rules.
If flows of multiple tunnel types built on same queue, no tunnel type
will be returned. User application could use bits in flow mark as tunnel
type identifier.
Signed-off-by: Xueming Li <xuemingl@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Since its inception, the rte_flow RSS action has been relying in part on
external struct rte_eth_rss_conf for compatibility with the legacy RSS API.
This structure lacks parameters such as the hash algorithm to use, and more
recently, a method to tell which layer RSS should be performed on [1].
Given struct rte_eth_rss_conf will never be flexible enough to represent a
complete RSS configuration (e.g. RETA table), this patch supersedes it by
extending the rte_flow RSS action directly.
A subsequent patch will add a field to use a non-default RSS hash
algorithm. To that end, a field named "types" replaces the field formerly
known as "rss_hf" and standing for "RSS hash functions" as it was
confusing. Actual RSS hash function types are defined by enum
rte_eth_hash_function.
This patch updates all PMDs and example applications accordingly.
It breaks ABI compatibility for the following public functions:
- rte_flow_copy()
- rte_flow_create()
- rte_flow_query()
- rte_flow_validate()
[1] commit 676b605182a5 ("doc: announce ethdev API change for RSS
configuration")
Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
Aligning Mellanox SPDX copyrights to a single format.
In addition replace to SPDX licence files which were missed.
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
In some environments it is desirable to have the NIC perform RSS
normally on the packet regardless of the number of queues configured.
The RSS hash result that is stored in the mbuf can then be used by
the application to make decisions about how to distribute workloads
to threads, secondary processes, or even virtual machines if the
application is a virtual switch. This change to the mlx5 driver
aligns with how other drivers in the Intel family work.
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
Tested-by: Allain Legacy <allain.legacy@windriver.com>
These functions return int although they are not supposed to fail,
resulting in unnecessary checks in their callers.
Some are returning error where is should be a boolean.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
This change removes the need to distinguish unlocked priv_*() functions
which are therefore renamed using a mlx5_*() prefix for consistency.
At the same time, all functions from mlx5 uses a pointer to the ETH device
instead of the one to the PMD private data.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
In priv struct only the memory region needs to be protected against
concurrent access between the control plane and the data plane.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Some empty lines have been added in the middle of the code without any
reason. This commit removes them.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Replaces all (void)foo; by __rte_unused macro except when variables are
under #if statements.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
The query for the tunnel stateless offloads is wrongly implemented
because of:
1. It was using the device id to query for the offloads.
2. It was using a compilation flag for Verbs which no longer exits.
The main reason was lack of proper API from Verbs.
Fixing the query to use rdma-core API. The capability returned from
rdma-core refer to both Tx and Rx sides.
Eventhough there is a separate cap for GRE and VXLAN, implementation merge
them into a single flag in order to simplify the checks on the data
path.
Fixes: 43e9d9794cde ("net/mlx5: support upstream rdma-core")
Fixes: f5fde5205101 ("net/mlx5: add hardware checksum offload for tunnel packets")
Cc: stable@dpdk.org
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Xueming Li <xuemingl@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
This lays the groundwork for externalizing rdma-core as an optional
run-time dependency instead of a mandatory one.
No functional change.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
When no memory is available on the same numa node than the device, the
initialization of the device fails. However, the use case where the
cores and memory are on a different socket than the device is valid,
even if not optimal.
To fix this issue, this commit introduces an infrastructure to select
the socket on which to allocate the verbs objects based on the ethdev
configuration and the object type, rather than the PCI numa node.
Fixes: 1e3a39f72d5d ("net/mlx5: allocate verbs object into shared memory")
Cc: stable@dpdk.org
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Create a rte_ethdev_driver.h file and move PMD specific APIs here.
Drivers updated to include this new header file.
There is no update in header content and since ethdev.h included by
ethdev_driver.h, nothing changed from driver point of view, only
logically grouping of APIs. From applications point of view they can't
access to driver specific APIs anymore and they shouldn't.
More PMD specific data structures still remain in ethdev.h because of
inline functions in header use them. Those will be handled separately.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Move device configuration and features capabilities to its own structure.
This structure is filled by mlx5_pci_probe(), outside of this function
it should be treated as *read only*.
This configuration struct will be used for the Tx/Rx queue setup to
select the Tx/Rx queue parameters based on the user configuration and
device capabilities.
In addition it will be used by the burst selection function to decide
on the best pkt burst to be used.
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Since the secondary process has its own devops, function which cannot be
called by the secondary don't need anymore to verify which process is
calling it.
Fixes: 87ec44ce1651 ("net/mlx5: add operations for secondary process")
Cc: stable@dpdk.org
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
mlx5_get_priv() is barely use across the driver. To avoid mixing access,
this function is definitely removed.
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
While the PMD avoids from creating hash RXQ with no hash fields and
array of queues after the port was already started, it lacks such
protection when re-creating the flows after the port restarts.
This may lead to inconsistent behavior for flows depending if they were
created before or after the port start.
Fixes: 8086cf08b2f0 ("net/mlx5: handle RSS hash configuration in RSS flow")
Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Indirection table size must be in log to communicate with verbs when the
number of queue is not a power of two, the maximum indirection table
size is use, but not converted to log2. This makes a memory corruption.
Fixes: 4c7a0f5ff876 ("net/mlx5: make indirection tables shareable")
Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>