The default CPU socket ID was used while creating the Rx queue and this caused
creation failure in case if hardware was not resided on the default socket.
The patch sets the correct CPU socket ID for the mlx5_rxq_ctrl before
calling the mlx5_rxq_create_devx_rq_resources() which eventually calls
mlx5_devx_rq_create() with correct CPU socket ID.
Fixes: bc5bee028e ("net/mlx5: create drop queue using DevX")
Cc: stable@dpdk.org
Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
If there are an explicit port match and sample action in the same flow,
mlx5 PMD pushes the explicit port match in the prefix subflow, and
uses the tag item match in the suffix subflow.
The explicit port match was translated into source vport match so
the sample suffix subflow lost this match after flow split.
This patch copies the explicit port match to the sample suffix subflow,
and the latter gets the correct source vport value in the flow matcher.
Fixes: b4c0ddbfcc ("net/mlx5: split sample flow into two sub-flows")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
A flow rule with sample action was split into two sub-flows,
and the implicit tag action with unique id was added in the prefix
sub-flow, the suffix sub-flow used the tag item to match with that
unique id, and the implicit set tag action was inserted next to
the sample action.
While there's either PUSH VLAN action or ENCAP action preceding the
sample action, implicit set tag action was added after PUSH VLAN or
ENCAP actions, causing flow creation failure due to rdma-core
does not support this action order.
This patch ensures the implicit set tag action is inserted before
either PUSH VLAN or encap action (if any) in the prefix sub-flow.
Fixes: 6a951567c1 ("net/mlx5: support E-Switch mirroring and jump in one flow")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
For now, only one ASO action is supported in a single flow rule.
Flow rule with more than one ASO action should be rejected in the
validation stage.
Flow rule with action non-shared AGE and COUNT together should be
treated as non-ASO because AGE will fall back to use HW counter,
not ASO hit object.
Group 0 will use HW counter for AGE action even if no COUNT action.
This commit will reject patterns (no matter which group if transfer)
like:
1. group 1 pattern... / end actions age / meter / end
2. group 1 pattern... / end actions conntrack / meter / end
3. group 1 pattern... / end actions age / conntrack... / end
If AGE comes together with COUNT in the above patterns, it's allowed.
Fixes: daed4b6e ("net/mlx5: use aging by counter when counter exists")
Cc: stable@dpdk.org
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Xiaoyu Min <jackmin@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
A flow rule with sample action will be split into two sub flows,
and a tag action was added implicitly in the sample prefix sub flow,
the reserved metadata regC index was used for this tag action.
The reserved metadata regC was shared with metering action,
for ConnectX-5 trusted device (VF/SF), the reserved metadata regC was
invalid since PF only supported the legacy metering.
This patch adds the checking for the tag index and back to use the
application tag if a failure happened.
Fixes: a9b6ea45be ("net/mlx5: fix tag ID conflict with sample action")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Flow domain and direction was validated when OF_PUSH_VLAN action
appears in flow actions. Flow was rejected whenever this action:
- was used in NIC domain, in ingress direction;
- was used in FDB domain, in ingress direction, on ConnectX-5.
This validation logic rejected a valid case when the OF_PUSH_VLAN
action was used when directing traffic to the hairpin queue,
configured in TX implicit mode.
This patch moves code responsible for OF_PUSH_VLAN validation of
domain and direction from flow_dv_validate_push_vlan() to
flow_dv_validate(). Domain and direction are now validated when either
non-hairpin queue is used or hairpin queue is configured in Tx explicit
mode.
Fixes: 96f85ec489 ("net/mlx5: check VLAN push/pop support")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Disable means there is no packet drop in the meter. Meter is
active always but programmed with another CIR/CBS value.
If the user wants to disable the meter in creation, PMD calls
the disable() API manually after meter initialized.
Fixes: 4443201863 ("net/mlx5: support meter creation with policy")
Cc: stable@dpdk.org
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
None Rx queue configured in a DPDK application should be supported.
In this mode, the NIC can be used to generate packets without
receiving any ingress traffic.
In the current implementation, once there is no Rx queue specified,
the array to store the queues' pointers is NULL after allocation.
Then the checking of the array allocation prevents the application
from starting up.
By adding another condition checking of the Rx queue number, the
application with none Rx queue can start up successfully.
Fixes: 4cda06c3c3 ("net/mlx5: split Rx queue into shareable and private")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
E-Switch DV flow is supported only when DV flow is supported and
enabled.
The mlx5_shared_dev_ctx_args_config() function ensures that when the
environment does not support DV, the "dv_esw_en" flag is turned off.
However, when the environment is supportive but the user has requested
to disable it, the "dv_esw_en" flag remains on and causes the PMD to try
to create an E-Switch through the Verbs engine.
This patch adds check to ensure that "dv_esw_en" flag will be turned off
when DV flow is disabled.
Fixes: a13ec19c19 ("net/mlx5: add shared device context config structure")
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
When using Verbs flow engine to create flows, GRE Verbs spec was put at
the end of specs list. This created problems for flows matching MPLSoGRE
packets. In generated specs list MPLS spec was put before GRE spec, but
Verbs API requires that MPLS spec must be put in its exact location in
protocol stack.
This patch fixes this behavior. Space for GRE Verbs spec is reserved at
its exact location. MPLS Verbs is inserted at its exact location as
well. GRE spec is filled after all flow items are parsed.
Fixes: 985b479267 ("net/mlx5: fix GRE protocol type translation for Verbs")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Flex item availability is restricted to BlueField-2 and BlueField-3
PF ports.
The patch validates port type compliance before proceeding to
flex item creation.
Fixes: db25cadc08 ("net/mlx5: add flex item operations")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The meter policy creation doesn't belong to flow rule creation
process, so thread workspace was not initialized and there will be
assert error when using it.
This patch removes the incorrect using of thread workspace in meter
policy creation, and adds a flag in policy instead. When creating
flow rule, can use the flag to set the mark flag in thread workspace.
Fixes: 082becbf1f ("net/mlx5: fix mark enabling for Rx")
Cc: stable@dpdk.org
Signed-off-by: Shun Hao <shunh@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
In the previous implementation, a count was used to record the number
of the references to a table resource, including the creation of the
table, the jumping to the table and the matchers created on the
table. Before releasing the table resource via the driver, it needed
to ensure that there is no reference to this table.
After the optimization of the resources management, the reference
count now is in the hash list entry as a unified solution for all the
resources management.
There is no need to keep the "refcnt" in the table resource
structure. It is removed in case that there is some unnecessary
memory overhead.
Fixes: afd7a62514 ("net/mlx5: make flow table cache thread safe")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Certain flow rules containing a modify header action for an L4 port
could be erroneously rejected as invalid, because this action
was counted as consuming two HW actions, while it only requires one.
Fixes: 72a944dba1 ("net/mlx5: fix header modify action validation")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
When an indirection table object is modified, it updates the reference
counter for each RX queue related to it.
The reference counter for regular queues are indeed updated. However,
the reference counter for external RxQs are not.
This patch adds updating for external RxQs too.
Fixes: 311b17e669 ("net/mlx5: support queue/RSS actions for external Rx queue")
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
When an indirection table is destroyed, each Rx queue related to it,
should be dereferenced.
The regular queues are indeed dereferenced.
However, the external RxQs are not.
This patch adds dereferencing for external RxQs too.
Fixes: 311b17e669 ("net/mlx5: support queue/RSS actions for external Rx queue")
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
When E-Switch mode was enabled, the NIC egress flows was implicitly
appended with source vport to match on. If the metadata register C0
was used to maintain the source vport, it was initialized to zero
on packet steering engine entry, the flow could be hit only
if source vport was zero, the register C0 of the packet was not correct
to match in the TX side, this caused egress flow misses.
This patch:
- removes the implicit source vport match for NIC egress flow.
- rejects the NIC egress flows on the representor ports at validation.
- allows the internal NIC egress flows containing the TX_QUEUE items in
order to not impact hairpins.
Fixes: ce777b147b ("net/mlx5: fix E-Switch flow without port item")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
When both shared and non-shared RSS actions are present in single
flow rule shared RSS index is unset by mistake.
For example:
1. flow indirect_action 0 create action_id 3 ingress action RSS ...
2. set sample_actions 0 mark id 43690 / queue index 0 / end
3. flow create 0 ingress group 107 pattern eth / sample ratio 2
index 0 / indirect 3 / end
PMD translates the indirect action to a shared RSS description at first.
In the split prefix flow, RSS->shared_RSS is unset when translating
sample queue action, the subfix flow will treat the RSS as non-shared.
Fixes: 8e61555657 ("net/mlx5: fix shared RSS and mark actions combination")
Cc: stable@dpdk.org
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
RSS expansion scheme has 2 operational modes: default and specific.
The default mode expands into all valid options for a given network
layer. For example, Ethernet expands by default into VLAN, IPv4 and
IPv6, L3 expands into TCP and UDP, etc.
The specific mode expands according to flow item next protocol
configuration provided by the item spec and mask parameters.
There are 3 outcomes for the specific expansion:
1. Back to default – that is the case when result of (spec & mask)
allows all possibilities.
For example: eth type mask 0 type spec 0
2. No results – in that case item configuration has no valid expansion.
For example: eth type mask 0xffff type spec 101
3. Direct - In that case flow item mask and spec configuration return
valid expansion option.
Example: eth type mask 0x0fff type spec 0x0800.
Current PMD expands flow items with explicit spec and mask
configuration into the Direct(3) or No results (2). Default expansions
were handled as No results.
Fixes: f3f1f576f4 ("net/mlx5: fix RSS expansion with explicit next protocol")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Flex item API provides support for network header with a fixed and
variable lengths.
When PMD compiles a new flex item object configuration it converts
RTE parameters into matching PMD PARSE_GRAPH parameters and checks
the parameter values against port capabilities.
Current implementation mismatched PARSE_GRAPH configuration fields
for the fixed size header.
Fixes: b293e8e49d ("net/mlx5: translate flex item configuration")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
On TCP/IP-based layered network, ICMP is considered and implemented
as part of layer 3 IP protocol. Actually, it is a user of the IP
protocol and must be encapsulated within IP packets. There is no
layer 4 protocol over ICMP.
The rule with layer 4 should be matched prior to the rule only with
layer 3 pattern when:
1. Both rules are created in the same table
2. Both rules could be hit
3. The rules has the same priority
The steering result of the packet is indeterministic if there are
rules with patterns IP and IP+ICMP in the same table with the same
priority. Like TCP / UDP, a packet should hit the rule with a longer
matching criterion.
By treating the priority of ICMP/ICMPv6 as a layer 4 priority in the
PMD internally, the IP+ICMP will be hit in prior to IP only.
Fixes: d53aa89aea ("net/mlx5: support matching on ICMP/ICMP6")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Reduce flex item flow handle size from 32 bits to 8 bits for each
flow.
The patch will save memory in setups with millions of flows.
Fixes: a23e9b6e3e ("net/mlx5: handle flex item in flows")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
GRE item translation must set inner protocol value.
For that reason the item is not translated inplace when PMD
translation iterates over flow items, but moved after the loop, when
all inner types are discovered.
If PMD does not translate GRE flow item inside the translation loop
it must save the GRE item for access outside the loop.
Fixes: 985b479267 ("net/mlx5: fix GRE protocol type translation for Verbs")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The AGE action can be implemented by either counters or ASO mechanism.
ASO is more efficient than generating counters just for the purpose of
aging, so when ASO is supported its use is preferable. On the other
hand, when there is count in the list of actions, the counter is already
generated, and it is best to use it for aging even if ASO is supported.
On the other hand, when the count action is "indirect", it cannot be
used for aging since it may be updated from other flow rules in which it
participates.
Checking whether ASO is supported depends on both the capability of the
device and the flow rule group number, ASO is not supported for group 0.
However, the flow_dv_validate() function only checks the capability and
ignores the group, allowing inadmissible flow rules.
For example, when the device supports ASO and a flow rule is set that
combines an indirect counter with aging for group 0, the rule should be
rejected, but it is created and does not function properly.
This patch updates the counter validation which will also consider the
group number when deciding if there is ASO support.
Fixes: daed4b6e3d ("net/mlx5: use aging by counter when counter exists")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The AGE action can be implemented by either counters or ASO mechanism.
When user ask count action in the flow rule, AGE action is implemented
by the same counter. However, if user ask indirect count action, it
cannot be used for AGE.
The flow_dv_validate() function has a flag named "shared_count" which
indicates whether AGE action validate depends on ASO support or not.
This flag is initialized to false and is updated if there is indirect
count action in the action list.
This flag is mistakenly set within the loop that reads the action list
and in each iteration it is reinitialized to false, regardless of the
existence of an indirect count action in the list.
This patch moves the flag initialization out of the loop.
Fixes: f3191849f2 ("net/mlx5: support flow count action handle")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The table remove callback function is trying to destroy the
matchers list associated with table entries without checking
if the list is valid, which causes null pointer dereference.
Fixed by validating the matchers list before destroying it.
Issue can be reproduced with testpmd on Windows, when you run:
port close all
Fixes: 1872635570 ("net/mlx5: make matcher list thread safe")
Cc: stable@dpdk.org
Signed-off-by: Adham Masarwah <adham@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Tal Shnaiderman <talshn@nvidia.com>
Tested-by: Idan Hackmon <idanhac@nvidia.com>
For indexed pool with local cache, when a new trunk is allocated,
half of the trunk's index was fetched to the local cache. In case
of local cache size was less then half of the trunk size, memory
overlap happened.
This commit adds the check of the fetch size, if local cache size
is less than fetch size, adjust the fetch size to be local cache
size.
Fixes: d15c0946be ("net/mlx5: add indexed pool local cache")
Cc: stable@dpdk.org
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Link status change takes time that depends on the HW and the kernel.
It was checked immediately after the change was issued at probing.
If the port had been down before probing, a "down" state may be read,
while the port would be "up" imminently.
After that, DPDK reported the port as "down" mistakenly
and "ifconfig $DEV up" did not trigger an LSC event,
because from the system's perspective the port was "up" already.
Install Netlink event handler at port probe before requesting the port
to come up in order to receive LSC event even if it comes up
between probe and start.
Fixes: a85a606ca5 ("net/mlx5: fix link status initialization")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Sometimes net/mlx5 devices did not detect link status change to "up".
Each shared device was monitoring IBV_EVENT_PORT_{ACTIVE,ERR}
and queried the link status upon receiving the event.
IBV_EVENT_PORT_ACTIVE is delivered when the logical link status
(UP flag) is set, but the physical link status (RUNNING flag)
may be down at that time, in which case the new link status
would be erroneously considered down.
IBV interface is insufficient for the task.
Monitor interface events using Netlink.
Fixes: 198a3c339a ("mlx5: handle link status interrupts")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Introduce mlx5_nl_read_events() to read Netlink events
(technically, messages) from a socket that was configured
to listen for them via a new mlx5_nl_init() parameter.
Add mlx5_nl_parse_link_status_update() helper
to extract information from link-related events.
This patch is a shared base for later fixes.
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Add support queue/RSS action for external Rx queue.
In indirection table creation, the queue index will be taken from
mapping array.
This feature supports neither LRO nor Hairpin.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
External queue is a queue that has been created and managed outside the
PMD. The queues owner might use PMD to generate flow rules using these
external queues.
When the queue is created in hardware it is given an ID represented by
32 bits. In contrast, the index of the queues in PMD is represented by
16 bits. To enable the use of PMD to generate flow rules, the queue
owner must provide a mapping between the HW index and a 16-bit index
corresponding to the ethdev API.
This patch adds an API enabling to insert/cancel a mapping between HW
queue id and ethdev queue id.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The RxQ/TxQ control structure has a field named type. This type is enum
with values for standard and hairpin.
The use of this field is to check whether the queue is of the hairpin
type or standard.
This patch replaces it with a boolean variable that saves whether it is
a hairpin.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
This patch adds matching on the optional fields (checksum/key/sequence)
of GRE header. The matching on checksum and sequence fields requests
support from rdma-core with the capability of misc5 and tunnel_header 0-3.
For patterns without checksum and sequence specified, keep using misc for
matching as before, but for patterns with checksum or sequence, validate
capability first and then use misc5 for the matching.
Signed-off-by: Sean Zhang <xiazhang@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
HW steering header reformat action can work under bulk mode. In
this case, when create the table, bulk size of header reformat
actions will be allocated in low level. Afterwards, when create
flow, just simply specify the action index in the bulk and the
encapsulation data to the action will be enough.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
HW steering can support indirect action as well. With indirect action,
the flow can be created with more flexible shared RSS action selection.
This will can save the action template with different RSS actions.
This commit adds the flow queue operation callback for:
rte_flow_async_action_handle_create();
rte_flow_async_action_handle_destroy();
rte_flow_async_action_handle_update();
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The mark action is covered by tag action internally. While it is added
the HW will add a tag to the packet. The mark value can be set as fixed
or dynamic as the action mask indicates.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
This commit adds the queue and RSS action. Similar to the jump action,
dynamic ones will be added to the action construct list.
Due to the queue and RSS action in template should not be destroyed
during port restart, the actions are created with standalone indirect
table as indirect action does. When port stops, detaches the indirect
table from action, when port starts, attaches the indirect table back
to the action.
One more change is made to accelerate the action creation. Currently
the mlx5_hrxq_get() function returns the object index instead of object
pointer. This introduced an extra converting the index to the object by
calling mlx5_ipool_get() in most of the case. And that extra converting
hurts multi-thread performance since mlx5_ipool_get() uses the global
lock inside. As the hash Rx queue object itself also contains the index,
returns the object directly will achieve better performance without the
global lock.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Jump action connects different level of flow tables and allows packet
handling in the chain of flows.
A new action construct data struct is also added in this commit to help
to handle not only the dynamic jump action but also for the other
generic dynamic actions. The actions with empty mask configuration means
dynamic action, and the dedicated action will be created with the flow
action configuration during flow creation. In that dynamic action case,
the action will be appended to the table template's action list during
table creation.
When creating the flows, traverse the action list and pick the dynamic
action configuration details from flow actions as the action construct
data struct describes, then create the dedicated dynamic actions.
This commit adds the jump action and the generic dynamic action
construct mechanism.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
In case port is being stopped, all created flows should be flushed.
This commit adds the flow flush helper function.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The HW steering uses async queue-based flow rules management
mechanism. The matcher and part of the actions have been
prepared during flow table creation. Some remaining actions
will be constructed during flow creation if needed.
A flow postpone attribute bit describes if flow management
should be applied to the HW directly. An extra push function
is provided to force push all the cached flows to the HW.
Once the flow has been applied to the HW, the pull function
will be called to get the queued creation/destruction flows.
The DR rule flow memory is represented in PMD layer instead
of allocating from HW steering layer. While destroying the
flow, the flow rule memory can only be freed after the CQE
received.
The HW queue job descriptor is currently introduced to convey
the flow information and operation type between the flow
insertion/destruction in the pull function.
This commit adds the basic flow queue operation for:
rte_flow_async_create();
rte_flow_async_destroy();
rte_flow_push();
rte_flow_pull();
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Flow table is a group of flows with the same matching criteria
and the same actions defined for them. The table defines rules
that have the same matching fields but with different matching
values. For example, matching on 5 tuple, the table will be
(IPv4 source + IPv4 dest + s_port + d_port + next_proto)
while the values for each rule will be different.
The templates' relevant matching criteria and action instances
will be created in the table creation and saved in the table.
As table attributes indicate the supported flow number, the flow
memory will also be allocated at the same time.
This commit adds the table management functions.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The action template holds a list of action types that will be
used together on the same rule. The template's actions instances
will be created only when the template bind to the dedicated
group. And the created actions will be saved to each individual
group in order for best performance. The actions in a group will
not be shared with each other unless shared actions are specified.
This commit adds the action template management which stores the
flow action template.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The pattern template defines flows that have the same matching
fields but with different matching values.
For example, matching on 5 tuple TCP flow, the template will be
(eth(null) + IPv4(source + dest) + TCP(s_port + d_port) while
the values for each rule will be different.
Due to the pattern template can be used in different domains, the
items will only be cached in pattern template create stage, while
the template is bound to a dedicated table, the HW criteria will
be created and saved to the table. The pattern templates can be
used by multiple tables. But different tables create the same
criteria and will not share the matcher between each other in order
to have better performance.
This commit adds pattern template management.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The hardware steering is backend to support rte_flow_async API in
mlx5 PMD. The port configuration function creates the queues and
needed flow management resources.
The PMD layer configuration function allocates the queues' context
and per-queue job descriptor pool. The job descriptor pool size
is equal to the queue size, and the job descriptors will be popped
from pool with LIFO strategy to convey the flow information during
flow insertion/destruction. Then, while polling the queued operation
result, the flow information will be extracted from the job descriptor
and the descriptor will be pushed back to the LIFO pool.
The commit creates the flow port queues and the job descriptor pools.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The new hardware steering engine relies on using dedicated steering WQEs
instead of writing to the low-level steering table entries directly.
In the first implementation the hardware steering engine supports the
new queue based Flow API, the existing synchronous non-queue based Flow
API is not supported.
A new dv_flow_en value 2 is added to manage mlx5 PMD steering engine:
dv_flow_en rte_flow API rte_flow_async API
------------------------------------------------
0 support not support
1 support not support
2 not support support
This commit introduces the extra dv_flow_en = 2 to specify the new
flow initialize and manage operation routine.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The HW steering low-level implementation will be added later in another
patch series. To avoid the linkage issues the abstract stub replacement
is provided currently.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The Connect-X steering is a lookup hardware mechanism that accesses flow
tables, matches packets to the rules, and performs specified actions.
Historically, mlx5 PMD implements several software engines to manage
steering hardware facility:
- FW Steering - Verbs/Direct Verbs, uses FW calls to manage flows
- SW Steering - DevX/mlx5dv, uses WQEs to access table memory directly
However, there are still some disadvantages:
- performance is limited, we should invoke firmware either to
manage the entire flow, or to handle some internal steering objects
- organizing and preparing flow infrastructure (actions, matchers,
groups, etc.) on the flow inserting is sure to cause slow flow
insertion
- security, exposing the low-level steering entries directly to the
userspace may cause security risks
A new hardware WQE based steering operation with codename "HW Steering"
is going to be introduced to get rid of the security risks. And it will
take advantage of the recently new introduced async queue-based rte_flow
APIs to prepare everything in advance to achieve high insertion rate.
In this new HW steering engine, the original SW steering rte_flow API
will not be supported in the first implementation, only the new async
queue-based flow operations is going to be supported. A new steering
mode parameter for dv_flow_en will be introduced and user will be
able to engage the new steering engine.
This commit adds the basic driver operation.
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The hardware since ConnectX-7 supports waiting on
specified moment of time with new introduced wait
descriptor. A timestamp can be directly placed
into descriptor and pushed to sending queue.
Once hardware encounter the wait descriptor the
queue operation is suspended till specified moment
of time. This patch update the Tx datapath to handle
this new hardware wait capability.
PMD documentation and release notes updated accordingly.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The wait on time configuration flag is copied to the Tx queue
structure due to performance considerations. Timestamp
mask is prepared and stored in queue structure as well.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The "tx_db_nc" devarg forces doorbell register mapping to non-cached
region eliminating the extra write memory barrier. This argument was
used in creating the UAR for Tx and thus affected its performance.
Recently [1] its use has been extended to all UAR creation in all mlx5
drivers, and now its name is no longer so accurate.
This patch changes its name to "sq_db_nc" to suit any send queue that
uses it. The old name will still work for backward compatibility.
[1] commit 5dfa003db5 ("common/mlx5: fix post doorbell barrier")
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Reviewed-by: Raslan Darawsheh <rasland@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
One of the E-Switch vports plays the special role - it is assigned as
"E-Switch manager" and has some special exclusive rights and duties - it
maintains all the representors, manages FDB domain flows, etc. By
default, the E-Switch vport index was supposed to be zero on standalone
NICs (regular ConnectX) and 0xFFFE SmartNIC (BlueField), but that was
not always correct - this index can be assigned with any value by
kernel/hypervisor.
Currently the E-Switch manager vport id is supposed to be default - 0
for standalone NICs, and 0xFFFE for the SmartNICs, and is deduced from
the device PCI id.
To handle this and do not suggest any default values, can use DevX API
to query E-Switch manager vport ID directly from the firmware during
initialization, and use that value by default. If the new method is not
provided (legacy firmware), fallback to use the PCI id approach.
Fixes: a564038699 ("net/mlx5: support E-Switch manager egress traffic match")
Cc: stable@dpdk.org
Signed-off-by: Shun Hao <shunh@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Recently shared RxQ has been introduced. All shared Rx queues with same
group and queue ID share the same rxq_ctrl, but each one has
mlx5_rxq_priv structure.
The mlx5_rx_queue_setup generates a new rxq_priv structure, and looks
for a rxq_ctrl structure to refer to. If there is already a compatible
rxq_ctrl structure it refers it, otherwise it calls the mlx5_rxq_new
function that generates a new one.
This patch makes mlx5_rxq_new function "standalone", it generates a
rxq_ctrl structure regardless to specific rxq_priv structure. All
operations on the rxq_ctrl structure that depend on the new rxq_priv
structure are performed in the mlx5_rx_queue_setup function, at the same
place for either a new rxq_ctrl structure or an existing rxq_ctrl
structure.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The mlx5_rxq_new function creates control structure and if it from
shared group, it is inserted into the shared RXQs list.
After that, there are some validations which in case they fail, RxQ
control object is released.
In these cases, invalid pointer to the object still in the list, and
access it may cause a crash.
Move the list insertion to the end of the function where the RxQ control
object is surely valid.
Fixes: 09c2555303 ("net/mlx5: support shared Rx queue")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Previously API flow_dv_query_count_ptr is defined to get counter's
action pointer. This DV function is directly called and the better
way is by the callback.
Add one arg in API mlx5_counter_query and the related callback
counter_query. The added arg is for counter's action pointer.
Signed-off-by: Haifei Luo <haifeil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
If meter policy action was RSS, the correct items were not provided
for sub-policy creation.
This fixes the issue by providing original items in meter split, so
the sub-policy creation gets the correct items.
Fixes: 3c481324ba ("net/mlx5: fix meter flow direction check")
Cc: stable@dpdk.org
Signed-off-by: Shun Hao <shunh@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The mlx5_l3t_prepare_entry() function is not used anymore.
This commit removes the unused mlx5_l3t_prepare_entry() function.
Fixes: 92ef4b8f16 ("ethdev: remove deprecated shared counter attribute")
Cc: stable@dpdk.org
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
While mlx5_hlist_create() failed, the rte_flow_error was not filled
with the corresponding error information.
This commit adds the missing rte_flow_error_set() for the failure case.
Fixes: f3020a331d ("net/mlx5: optimize hash list table allocate on demand")
Cc: stable@dpdk.org
Signed-off-by: Suanming Mou <suanmingm@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Improve the devargs handling in two aspects:
- Parse the devargs string only once.
- Return error and report for unknown keys.
The common driver parses once the devargs string into a dictionary, then
provides it to all the drivers' probe. Each driver updates within it
which keys it has used, then common driver receives the updated
dictionary and reports about unknown devargs.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Add configuration structure for port (ethdev). This structure contains
all configurations coming from devargs which oriented to port. It is a
field of mlx5_priv structure, and is updated in spawn function for each
port.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Add inline function indicating whether HW objects operations can be
created by DevX. It makes the code more readable.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Add configuration structure for shared device context. This structure
contains all configurations coming from devargs which oriented to
device. It is a field of shared device context (SH) structure, and is
updated once in mlx5_alloc_shared_dev_ctx() function.
This structure cannot be changed when probing again, so add function to
prevent it. The mlx5_probe_again_args_validate() function creates a
temporary IB context configure structure according to new devargs
attached in probing again, then checks the match between the temporary
structure and the existing IB context configure structure.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Move all device configure to be performed by mlx5_os_cap_config()
function instead of the spawn function.
In addition move all relevant fields from mlx5_dev_config structure to
mlx5_dev_cap.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Rearrange the mlx5_os_get_dev_attr() function in such a way that it
first executes the queries and only then updates the fields.
In addition, it changed its name in preparation for expanding its
operations to configure the capabilities inside it.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
This patch adds in SH structure a flag which indicates whether is
E-Switch mode.
When configure "dv_esw_en" from devargs, it is enabled only when is
E-switch mode. So, since dv_esw_en has been configure, it is enough to
check if "dv_esw_en" is valid.
This patch also removes E-Switch mode check when "dv_esw_en" is checked
too.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The mlx5_flow_counter_mode_config function exists for both Linux and
Windows with the same name and content.
This patch moves its implementation to the folder shared between the
operating systems, removing the duplication.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The realtime timestamp configure work for Linux as same as Windows.
This patch removes it to the function implemented in the folder shared
between the operating systems, removing the duplication.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The check if device is VF work for Linux as same as Windows.
This patch removes it to the function implemented in the folder shared
between the operating systems, removing the duplication.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The sharing device context structure has a field named "device_attr"
which is filled by mlx5_os_get_dev_attr() function.
The spawn function calls mlx5_os_get_dev_attr() again and save it to
local variable identical to "device_attr" field.
There is no need for this duplication, because there is a reference to
the sharing device context structure from spawn function.
This patch removes the local "device_attr" from spawn function, and uses
the context's field instead.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The sharing device context structure has a field named "devx" which
indicates if DevX is supported.
The common configure structure has also field named "devx" with the same
meaning.
There is no need for this duplication, because there is a reference to
the common structure from within the sharing device context structure.
This patch removes it from sharing device context structure and uses the
common config structure instead.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The HCA attribute structure is field of net configure structure.
It is also field of common configure structure.
There is no need for this duplication, because there is a reference to
the common structure from within the net structures.
This patch removes it from net configure structure and uses the common
config structure instead.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The device arguments are parsed and updated twice during spawning. First
time before creating the share device context, and again later after
updating a default value to one of the arguments.
This patch consolidates them into one parsing and updates the default
values before it.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
In mlx5_ethdev.c file are implemented those 4 functions:
- mlx5_dev_infos_get
- mlx5_fw_version_get
- mlx5_dev_set_mtu
- mlx5_hairpin_cap_get
In mlx5.h file they are declared twice. First time under mlx5.c file and
second time under mlx5_ethdev.c file.
This patch removes the redundant declaration.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The mlx5_alloc_shared_dev_ctx() function has a local variable named
"err" which contains the errno value in case of failure.
When functions called by this function are failed, this variable is
updated with their return value (that should be a positive errno value).
However, some functions doesn't update errno value by themselves or
return negative errno value. If one of them fails, the "err" variable
contains negative value what cause to assertion failure.
This patch updates all functions uses by mlx5_alloc_shared_dev_ctx()
function to update rte_errno and take this value instead of "err" value.
Fixes: 5dfa003db5 ("common/mlx5: fix post doorbell barrier")
Fixes: 5d55a494f4 ("net/mlx5: split multi-thread flow handling per OS")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The ASO connection tracking structure is initialized once for sharing
device context.
Its release takes place in the close function which is called for each
ethdev individually. i.e. when there is more than one ethdev under the
same sharing device context, it will be destroyed when one of them is
closed. If the other wants to use it later, it may cause it to crash.
In addition, the creation of this structure is performed in the spawn
function. If one of the creations of the objects following it fails, it
is supposed to be destroyed but this does not happen.
This patch moves its release to the sharing device context free function
and thus solves both problems.
Fixes: 0af8a2298a ("net/mlx5: release connection tracking management")
Fixes: ee9e5fad03 ("net/mlx5: initialize connection tracking management")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
In "dv_xmeta_en" devarg there is an option of dv_xmeta_en=3 which
engages tunnel offload mode. In E-Switch configuration, that mode
implicitly activates dv_xmeta_en=1.
The update according to E-switch support is done immediately after the
first parsing of the devargs, but there is another adjustment later.
This patch moves the adjustment after the second parsing.
Fixes: 4ec6360de3 ("net/mlx5: implement tunnel offload")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The MLX5 net driver supports "probe again". In probing again, it
creates a new ethdev under an existing infiniband device context.
Sibling devices sharing infiniband device context should have compatible
configurations, so some of the devargs given in the probe again, the
ones that are mainly relevant to the sharing device context are sent to
the mlx5_dev_check_sibling_config function which makes sure that they
compatible its siblings.
However, the arguments are adjusted according to the capability of the
device, and the function compares the arguments of the probe again
before the adjustment with the arguments of the siblings after the
adjustment. A user who sends the same values to all siblings may fail in
this comparison if he requested something that the device does not
support and adjusted.
This patch moves the call to the mlx5_dev_check_sibling_config function
after the relevant adjustments.
Fixes: 92d5dd4834 ("net/mlx5: check sibling device configurations mismatch")
Fixes: 2d241515eb ("net/mlx5: add devarg for extensive metadata support")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Multiple PMDs have dummy/noop Rx/Tx packet burst functions.
These dummy functions are very simple, introduce a common function in
the ethdev and update drivers to use it instead of each driver having
its own functions.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
This patch removes a redundant assert in mlx5_tx_packet_multi_tso().
That assert assured that the amount of bytes requested to be inlined
is greater than or equal to the minimum amount of bytes required
to be inlined. This requirement is either derived from the NIC
inlining mode or configured through devargs. When using TSO this
requirement can be disregarded, because on all NICs it is satisfied by
TSO inlining requirements, since TSO requires L2, L3, and L4 headers to
be inlined.
Fixes: 18a1c20044 ("net/mlx5: implement Tx burst template")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Meter capabilities reporting is not up to date.
Mellanox NICs support RFC2698 and RFC4115 as well as RFC2697.
Add these marker operations to the capabilities list.
Fixes: 6bc327b94f ("net/mlx5: fill meter capabilities using DevX")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Committed Bucket Size calculation tries to fit into 8-bit wide
mantissa field by setting 256 as a maximum value for it.
To compensate for this increase in the mantissa value the exponent
value has to be reduced by 8. But it gives a negative exponent
value for CBS less than 128. And negative exponent value is not
supported by the NIC. Adjust CSB calculation only for values
bigger than 128 to allow both small and big bucket sizes.
Fixes: 3bd26b23ce ("net/mlx5: support meter profile operations")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Due to updated modify field action immediate value buffer
pattern [1] the implicit shift for the metadata is not
needed anymore and should be removed.
[1] commit 40c8fb1fd3 ("net/mlx5: update modify field action")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
As modify field action immediate source parameter the metadata
should follow the CPU endianness (according to SET_META action
structure format), and mlx5 PMD wrongly handled the immediate
parameter metadata buffer as big-endian, resulting in wrong
metadata set action with incorrect endianness.
Fixes: 40c8fb1fd3 ("net/mlx5: update modify field action")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Functions like free, rte_free, and rte_mempool_free
already handle NULL pointer so the checks here are not necessary.
Remove redundant NULL pointer checks before free functions
found by nullfree.cocci
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
The support for linking rte_pmd_mlx5.h functions with
C++ applications was missing.
Signed-off-by: Elena Agostini <eagostini@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Currently root table as destination is not supported.
The jump action which finally be translated to underlying root table in
rdma-core should be rejected.
Fixes: f78f747f41 ("net/mlx5: allow jump to group lower than current")
Cc: stable@dpdk.org
Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
To optimize datapath, the mlx5 pmd checked for mark action on flow
creation, and flagged possible destination rxqs (through queue/RSS
actions), then it enabled the mark action logic only for flagged rxqs.
Mark action didn't work if no queue/rss action was in the same flow,
even when the user use multi-group logic to manage the flows.
So, if mark action is performed in group X and the packet is moved to
group Y > X when the packet is forwarded to Rx queues, SW did not get
the mark ID to the mbuf.
Flag Rx datapath to report mark action for any queue when the driver
detects the first mark action after dev_start operation.
Fixes: 8e61555657 ("net/mlx5: fix shared RSS and mark actions combination")
Cc: stable@dpdk.org
Signed-off-by: Raja Zidane <rzidane@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Preparation of the stride size and the number of strides for
Multi-Packet RQ was updated recently to accommodate the hardware
limitation about minimum WQE size. The wrong assertion was
introduced to ensure this limitation is met. Assert that the
configured WQE size is not less than the minimum supported size.
Fixes: 34776af600 ("net/mlx5: fix MPRQ stride devargs adjustment")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The maximum packet headers size for TSO is calculated as a sum of
Ethernet, VLAN, IPv6 and TCP headers (plus inner headers).
The rationale behind choosing IPv6 and TCP is their headers
are bigger than IPv4 and UDP respectively, giving us the maximum
possible headers size. But it is not true for L3 headers.
IPv4 header size (20 bytes) is smaller than IPv6 header size
(40 bytes) only in the default case. There are up to 10
optional header fields called Options in case IHL > 5.
This means that the maximum size of the IPv4 header is 60 bytes.
Choosing the wrong maximum packets headers size causes inability
to transmit multi-segment TSO packets with IPv4 Options present.
PMD check that it is possible to inline all the packet headers
and the packet headers size exceeds the expected maximum size.
The maximum packet headers size was set to 192 bytes before,
but its value has been reduced during Tx path refactor activity.
Restore the proper maximum packet headers size for TSO.
Fixes: 50724e1bba ("net/mlx5: update Tx definitions")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
A debug assertion in Single-Packet Receive Queue (SPRQ) mode
required all Rx mbufs to have a 128 byte headroom,
based on the assumption that rte_pktmbuf_init() sets it.
However, rte_pktmbuf_init() may set a smaller headroom
if the dataroom is insufficient, e.g. this is a natural case
for split buffer segments. The headroom can also be larger.
Only check the headroom size when vectored Rx routines
are used because they rely on it. Relax the assertion
to require sufficient headroom size, not an exact one.
Fixes: a0a45e8af7 ("net/mlx5: configure Rx queue for buffer split")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
When building with -Db_sanitize=thread, GCC gives a warning:
drivers/net/mlx5/mlx5_flow_meter.c: In function ‘mlx5_flow_meter_create’:
drivers/net/mlx5/mlx5_flow_meter.c:1170:33: warning: ‘legacy_fm’ may be
used uninitialized in this function [-Wmaybe-uninitialized]
This is a false-positive: legacy_fm is initialized and used
if and only if priv->sh->meter_aso_en is false.
Work around this by initializing legacy_fm to NULL.
Add an assertion before legacy_fm use in case the logic changes.
Fixes: 4443201863 ("net/mlx5: support meter creation with policy")
Cc: stable@dpdk.org
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Add support for the imissed counter using the DevX API on Windows.
imissed is queried by creating a queue counter for the port, attaching
it to all created RQs and querying the "out_of_buffer" field.
If the counter cannot be created, imissed will always report 0.
Signed-off-by: Tal Shnaiderman <talshn@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
When application creates several flows to match on GRE tunnel without
explicitly specifying GRE protocol type value in flow rules, PMD will
translate that to zero mask.
RDMA-CORE cannot distinguish between different inner flow types and
produces identical matchers for each zero mask.
The patch extracts inner header type from flow rule and forces it in
GRE protocol type, if application did not specify any.
Fixes: 84c406e745 ("net/mlx5: add flow translate function")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The PMD RSS expansion scheme by default compiles flow rules for all
flow item types that may branch out from a stub supplied
by application.
For example,
ETH can lead to VLAN, IPv4 or IPv6.
IPv4 can lead to UDP, TCP, IPv4 or IPv6.
If application explicitly specified next protocol type, expansion must
use that option only and not create flows with other protocol types.
The PMD ignored explicit next protocol values in GRE and VXLAN-GPE.
The patch updates RSS expansion for GRE and VXLAN-GPE with explicit
next protocol settings.
Fixes: c7870bfe09 ("ethdev: move RSS expansion code to mlx5 driver")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Fixed the assertion on the flags set in pkt->ol_flags for vectorized
MPRQ. With vectorized MPRQ the CQs are processed before copying the
MPRQ bufs so the valid assertion is that the expected flag is set and
not that the pkt->ol_flags equlas this flag alone.
Fixes: 0f20acbf5e ("net/mlx5: implement vectorized MPRQ burst")
Cc: stable@dpdk.org
Signed-off-by: Lior Margalit <lmargalit@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
In ASO objects creation (WQE, CQE and MR), socket number is given as
a parameter.
The selection was wrongly socket 0 hardcoded even if the user didn't
configure memory for this socket.
This patch replaces the selection to default socket (SOCKET_ID_ANY).
Fixes: f935ed4b64 ("net/mlx5: support flow hit action for aging")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
In Multi-Packet RQ creation, the user can choose the number of strides
and their size in bytes. The user updates it using specific devargs for
both of these parameters.
The above two parameters determine the size of the WQE which is actually
their product of multiplication.
If the user selects values that are not in the supported range, the PMD
changes them to default values. However, apart from the range
limitations for each parameter individually there is also a minimum
value on their multiplication. When the user selects values that their
multiplication are lower than minimum value, no adjustment is made and
the creation of the WQE fails.
This patch adds an adjustment in these cases as well. When the user
selects values whose multiplication is lower than the minimum, they are
replaced with the default values.
Fixes: ecb160456a ("net/mlx5: add device parameter for MPRQ stride size")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
In the striding RQ management there are two important parameters, the
size of the single stride in bytes and the number of strides.
Both the data-path structure and config structure keep the log of the
above parameters. However, in their names there is no mention that the
value is a log which may be misleading as if the fields represent the
values themselves.
This patch updates their names describing the values more accurately.
Fixes: ecb160456a ("net/mlx5: add device parameter for MPRQ stride size")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The MAC addresses fields are 48 bit wide and are processed
by mlx5 PMD as two words. There the bug was introduced for
the offset, causing malfunction of MODIFY_FIELD action
with MAC address fields as source or destination and
with non zero field offset.
Fixes: 40c8fb1fd3 ("net/mlx5: update modify field action")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The mlx5_args function reads the devargs and checks if they are valid
for this driver and if not it returns an error.
This was normal behavior as long as all the devargs come to this driver,
but since it is possible to run several drivers together, the function
may return an error for another driver's devarg even though it is
completely valid.
In addition the function does not allow the user to know which of the
devargs is incorrect, but returns an error without printing the
unknown devarg.
This patch eliminates the error return in the case of an unknown devarg,
and prints a warning for each such devarg specifically.
Fixes: 7b4f1e6bd3 ("common/mlx5: introduce common library")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Remove the use of double "the" as it does not make sense.
Cc: stable@dpdk.org
Signed-off-by: Sean Morrissey <sean.morrissey@intel.com>
Signed-off-by: Conor Fogarty <conor.fogarty@intel.com>
Acked-by: John McNamara <john.mcnamara@intel.com>
Reviewed-by: Conor Walsh <conor.walsh@intel.com>
Acked-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
While joining the shared Rx queue to the existing queue group,
the queue configurations is checked to be the same as it was
specified in the first group queue creation - all shared
queues should be created with identical configurations.
During the Rx queue creation the buffer split segment
configuration can be altered - the zero segment sizes are
substituted with the actual ones, inherited from the pools,
number of segments can be extended to cover the maximal
packet length, etc. It means the actual queue segment
configuration can not be used directly to match the
configuration provided in the queue setup call.
To resolve an issue we should store original parameters
in the shared queue structure and perform the check against
one of these stored ones.
Fixes: 09c2555303 ("net/mlx5: support shared Rx queue")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
GENEVE and VXLAN-GPE item matching is done similarly to GRE matching.
Users can skip the specification of the protocol type and expect that
this type is deducted from the inner header type automatically.
But the inner header type may not be specified in order to match all the
protocol types. In this case, PMD should not specify the protocol type.
Check if we have the inner header type before setting the protocol type.
Fixes: 690391dd0e ("net/mlx5: fix GENEVE protocol type translation")
Fixes: 861fa3796f ("net/mlx5: fix VXLAN-GPE next protocol translation")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
GRE protocol type is implicitly set in the matching translation in case
an application doesn't specify any type explicitly in a flow rule.
It is extracted from the inner header type, but this type may be absent.
In this case, GRE item matching is broken. Check if we have the inner
header type before setting it to allow matching on all GRE packets.
Fixes: be26e81bfc ("net/mlx5: fix GRE protocol type translation")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
mlx5_ind_table_obj_modify() was not changing the reference counters
of neither the new set of RxQs, nor the old set of RxQs.
On the other hand, creation of the RSS incremented the RxQ refcnt.
If an RxQ was present in both the initial and the modified set,
its reference counter was incremented one extra time
compared to the queues that were only present in the new set.
This prevented releasing said RxQ resources on port stop:
flow indirect_action 0 create action_id 1 \
action rss queues 0 1 end / end
flow indirect_action 0 update 1 \
action rss queues 2 3 end / end
quit
...
mlx5_net: mlx5.c:1622: mlx5_dev_close():
port 0 some Rx queue objects still remain
mlx5_net: mlx5.c:1626: mlx5_dev_close():
port 0 some Rx queues still remain
Increment reference counters for the new set of RxQs
and decrement them for the old set of RxQs when needed.
Remove explicit referencing of RxQ from mlx5_ind_table_obj_attach()
because it reuses mlx5_ind_table_obj_modify() code doing this.
Fixes: ec4e11d41d ("net/mlx5: preserve indirect actions on restart")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Matan Azrad <matan@nvidia.com>
mlx5_ind_table_obj_setup() was incrementing RxQ reference counters
even when the port was stopped, which prevented RxQ release
and triggered an internal assertion.
Only increment reference counter when the port is started.
Fixes: ec4e11d41d ("net/mlx5: preserve indirect actions on restart")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Matan Azrad <matan@nvidia.com>
If mlx5_rxq_start() failed and rxq_ctrl was not initialized,
mlx5_rxq_obj_verify() would segfault in an attempt to dereference it.
Add a check that rxq_ctrl is not NULL before accessing its members.
Fixes: 09c2555303 ("net/mlx5: support shared Rx queue")
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
This patch fixes segfault which was triggered when port, with indirect
actions created, was closed. Segfault was occurring only when
RTE_LIBRTE_MLX5_DEBUG was defined. It was caused by redundant decrement
of RX queues refcount:
- refcount was decremented when port was stopped and indirect actions
were detached from RX queues (port stop),
- refcount was decremented when indirect actions objects were destroyed
(port close or destroying of indirect action).
This patch fixes behavior. Dereferencing Rx queues is done if and only
if indirect action is explicitly destroyed by the user or detached on
port stop. Dereferencing Rx queues on action destroy operation depends
on an argument to the wrapper of indirect action destroy operation,
introduced in this patch.
Fixes: ec4e11d41d ("net/mlx5: preserve indirect actions on restart")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
This patch fixes the assertion failure triggered when the user
configured minimum inline length requirements and the application
transmitted multi segment packets. Failure was triggered when space left
in TX queue was not enough to cover this requirement.
This patch limits the length of data to be copied to the available space
in TX queue.
Fixes: cacb44a099 ("net/mlx5: add no-inline Tx flag")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
In a meter hierarchy, all the meters are marked with having RSS if
the final meter's termination action is RSS.
When validating a flow rule with meter hierarchy, the RSS action
should not be fetched from the current meter if it is not the final
one.
The fate action union is next meter ID instead of the pointer to the
RSS action. By using the final meter in the hierarchy, the flow rule
validation will succeed without any crash caused by the invalid RSS
action pointer access.
Fixes: 1ce19ab1f4 ("net/mlx5: fix RSS validation with meter policy")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Reviewed-by: Li Zhang <lizh@nvidia.com>
Reviewed-by: Shun Hao <shunh@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
If there are sample action and the meter action in the same flow,
mlx5 PMD performs several levels of splitting. For example, sampling
feature splits the original flow into prefix subflow with sample action,
and suffix subflow with the rest of actions. Then, metering feature
splits the sampling suffix subflow into its own meter subflows.
If mark action was added before the sample and meter action, the
flow mark flag was kept in the sample subflows but reset on
handling the metering split, causing the flow mark value missed.
This patch keeps the flow mark flag of previous subflow, and then
the following meter subflows handle the flow mark correctly.
Fixes: 9ade91dfe8 ("net/mlx5: fix group value of sample suffix flow")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Mempool registration was not correctly processing
mempools with RTE_PKTMBUF_F_PINEND_EXT_BUF flag set
("pinned mempools" for short), because it is not known
at registration time whether the mempool is a pktmbuf one,
and its elements may not yet be initialized to analyze them.
Attempts had been made to recognize such pools,
but there was no robust solution, only the owner of a mempool
(the application or a device) knows its type.
This patch extends common/mlx5 registration code
to accept a hint that the mempool is a pinned one
and uses this capability from net/mlx5 driver.
1. Remove all code assuming pktmbuf pool type
or trying to recognize the type of a pool.
2. Register pinned mempools used for Rx
and their external memory on port start.
Populate the MR cache with all their MRs.
3. Change Tx slow path logic as follows:
3.1. Search the mempool database for a memory region (MR)
by the mbuf pool and its buffer address.
3.2. If not MR for the address is found for the mempool,
and the mempool contains only pinned external buffers,
perform the mempool registration of the mempool
and its external pinned memory.
3.3. Fall back to using page-based MRs in other cases
(for example, a buffer with externally attached memory,
but not from a pinned mempool).
Fixes: 690b2a88c2 ("common/mlx5: add mempool registration facilities")
Fixes: fec28ca0e3 ("net/mlx5: support mempool registration")
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The mlx5 PMD introduced the table id attribute to allow multiple
flow tables on the same table level for flow metering, there can be
multiple flow table objects with the same table level but different
table ids.
If the extended metadata mode is enabled, all flows containing
destination Queue/RSS actions are split into two subflows - prefix one
jumps to the MLX5_FLOW_MREG_CP_TABLE_GROUP flow table to copy
MARK action data, and suffix one to perform the destination Queue/RSS
action. The table_id for the jump in the metadata split prefix flow
is always 0.
If flow itself was the metering split suffix subflow the table id was
set to 1 in the flow split structure and the metadata split suffix
subflow was created in the table with wrong table id, causing the
metadata suffix flow mismatch.
This patch resets the table id to 0 while creating the metadata
suffix flows.
Fixes: 51ec04dc7b ("net/mlx5: connect meter policy to created flows")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
In the metadata flow split, PMD created the prefix subflow
with removed Queue or RSS action and appended the set tag and
copy table jump actions. If the flow being split for metadata
was the meter prefix subflow, the driver supposed to share the same
meter split tag action for the metadata split flow. There was the wrong
check for preceding meter split tag action, causing append with metadata
split set tag action and resulting the meter suffix subflow was missed
due to tag value mismatch.
This patch adds the checking before copying into extend action list,
to make sure the correct shared tag is used.
Fixes: 8d72fa6689 ("net/mlx5: share tag between meter and metadata")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
If the modify field action requests the field copy/set transaction
from other field, the destination field hardware bit offset was
assigned incorrectly with non-zero byte offset, causing wrong
translations for the fields with sizes larger than 32 bits.
Fixes: 40c8fb1fd3 ("net/mlx5: update modify field action")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
mlx5 PMD incorrectly overwrote outer layer fields in MPLS tunnel
rte_flow patterns using defaults for MPLS tunnels. This included
overwriting UDP destination port in MPLSoUDP and GRE protocol field in
MPLSoGRE.
This patch fixes this behavior. If application provides the values in
flow pattern items preceding the MPLS flow item the provided values will
be used, otherwise the defaults will be applied.
Fixes: d1abe664dd ("net/mlx5: add MPLS to Direct Verbs flow engine")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Assuming a user tried to send multi-segment packets, with
RTE_PMD_MLX5_FINE_GRANULARITY_INLINE flag set, using a device with
minimum inlining requirements (such as ConnectX-4 Lx or when user
specified them explicitly), sending such packets caused segfault.
Segfault was caused by failed invariants in
mlx5_tx_packet_multi_inline function.
This patch introduces a logic for multi-segment packets, with
RTE_PMD_MLX5_FINE_GRANULARITY_INLINE flag set, to omit mbuf scanning for
filling inline buffer and inline only minimal amount of data required.
Fixes: ec837ad0fc ("net/mlx5: fix multi-segment inline for the first segments")
Cc: stable@dpdk.org
Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Mempool registration code had a wrong assumption that it is always
dealing with packet mempools and always called rte_pktmbuf_priv_flags(),
which returned a random value for different types of mempools.
In particular, it could consider MPRQ mempools as having externally
pinned buffers, which is wrong.
Packet mempools cannot be reliably recognized, but it is sufficient to
check that the mempool is not a packet one, so it cannot have externally
pinned buffers.
Compare mempool private data size to that of packet mempools to check.
Fixes: 690b2a88c2 ("common/mlx5: add mempool registration facilities")
Fixes: fec28ca0e3 ("net/mlx5: support mempool registration")
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The lock sh->txpp.mutex was not correctly released on one path
of cleanup function return, potentially causing the deadlock.
Fixes: d133f4cdb7 ("net/mlx5: create clock queue for packet pacing")
Cc: stable@dpdk.org
Signed-off-by: Chengfeng Ye <cyeaa@connect.ust.hk>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
When a port starts in non-isolated mode,
an internal indirect RSS is created that includes all configured queues
and a flow rule is created that references this indirect RSS.
If before switching to non-isolated mode an indirect RSS was created
that includes the same set of queues, it would be reused at this point.
However, because the port had been stopped (or not yet started),
the TIR for this indirect RSS had been destroyed (or not yet created).
The flow rule could not be created and the port start failed.
Creation of TIRs is moved before configuring non-isolated mode flows,
but it is not enough because of the following issue.
Commit 0cedf34da7 ("net/mlx5: move Rx queue reference count")
changed mlx5_rxq_get() not to increment RxQ control structure
reference count, mlx5_rxq_ref() was introduced for this purpose.
mlx5_ind_table_obj_attach() was not updated to use the new function,
so when the port was stopped, the control structure reference count
of an RxQ used in RSS reached zero and the structure was destroyed.
Use mlx5_rxq_ref() to keep RxQ control structure
needed for indirect RSS persistence across port restart.
Fixes: ec4e11d41d ("net/mlx5: preserve indirect actions on restart")
Fixes: 0cedf34da7 ("net/mlx5: move Rx queue reference count")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The RSS can be one of the fate actions when creating a meter with
policy. In the previous implementation, the RSS validation was missed
when creating a flow rule with such meter due to the fact that a
policy meter was created firstly and then used in the rule. In the
stage of meter creation, no rte_flow_item* information was provided.
A unnecessary RSS expansion might be called since the validation was
missed and would cause an unexpected error of the rule creation. Even
though the rule should be rejected from the very beginning, it would
cause confusion. There might be some other errors when the validation
was missed.
Adding the RSS validation inside the meter action validation will
prevent the code from continuing when there is a conflict between the
items, other actions and the policy meter RSS action.
Fixes: 4443201863 ("net/mlx5: support meter creation with policy")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Reviewed-by: Li Zhang <lizh@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
When application creates several flows to match on GRE tunnel
without explicitly specifying GRE protocol type value in
flow rules, PMD will translate that to zero mask.
RDMA-CORE cannot distinguish between different inner flow types and
produces identical matchers for each zero mask.
The patch extracts inner header type from flow rule and forces it
in GRE protocol type, if application did not specify
any without explicitly specifying GRE protocol type value in
flow rules, protocol type value.
Fixes: fc2c498ccb ("net/mlx5: add Direct Verbs translate items")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
When application creates several flows to match on GENEVE tunnel
without explicitly specifying GENEVE protocol type value in
flow rules, PMD will translate that to zero mask.
RDMA-CORE cannot distinguish between different inner flow types and
produces identical matchers for each zero mask.
The patch extracts inner header type from flow rule and forces it
in GENEVE protocol type, if application did not specify
any without explicitly specifying GENEVE protocol type value in
flow rules, protocol type value.
Fixes: e59a5dbcfd ("net/mlx5: add flow match on GENEVE item")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
RFC-2784 allows any valid Ethernet type in GRE protocol type field.
Add Ethernet to GRE RSS expansion.
Fixes: f4b901a46a ("net/mlx5: add flow GRE item")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
VXLAN-GPE extends VXLAN protocol and provides the next protocol
field specifying the first inner header type.
The application can assign some explicit value to
VXLAN-GPE::next_protocol field or set it to the default one. In the
latter case, the rdma-core library cannot recognize the matcher
built by PMD correctly, and it results in hardware configuration
missing inner headers match.
The patch forces VXLAN-GPE::next_protocol assignment if the
application did not explicitly assign it to the non-default value
Fixes: 90456726eb ("net/mlx5: fix VXLAN-GPE item translation")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Inside the MR control structure there is a pointer to the common device.
This pointer enables access to the global cache as well as hardware
objects that may be required in case a new MR needs to be created.
The purpose of adding this pointer into the MR control structure was to
avoid its transfer as a parameter to all the functions of searching MR
in the caches.
However, adding it to this structure increased the Rx and Tx data-path
structures, all the fields that followed it were slightly moved away
which caused to a reduction in performance.
This patch removes the pointer from the structure. It can be accessed
through the "dev_gen_ptr" existing field using the "container_of"
operator.
Fixes: 334ed198ab ("common/mlx5: remove redundant parameter in MR search")
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The attribute to record the global control of hairpin queues' delay
drop was defined as a bit-field with one bit, and the intention was
to reduce the memory overhead. In the meanwhile, the macro was
defined as an enumerated value 0x2.
No matter what value inputted via devarg, the lowest bit was always
zero and the higher bits would be ignored. For hairpin queues, the
delay drop attribute couldn't be enabled.
With the commit, the double logical negation is used to fix this.
Fixes: febcac7b46 ("net/mlx5: support Rx queue delay drop")
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
There was a redundant check for the enabled E-Switch, this
resulted in device probing failure if the Tx scheduling was
requested and E-Switch was enabled.
Fixes: f17e4b4ffe ("net/mlx5: add Tx scheduling check on queue creation")
Cc: stable@dpdk.org
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The routine converting RTE flow modify field action into
field driver's presentation did not specify the field mask
correctly and this resulted into wrong conversion for
the actions with shifted fields.
Fixes: 40c8fb1fd3 ("net/mlx5: update modify field action")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The global redirection table is used to create the default flow
rules for the ingress traffic with the lowest priority. It is also
used to create the default RSS rule in the destination table when
there is a tunnel offload.
To update the RETA in-flight, there is no restriction in the ethdev
API. In the previous implementation of mlx5, a port restart was
needed to make the new configuration take effect.
The restart is heavy, e.g., all the queues will be released and
reallocated, users' rules will be flushed. Since the restart is
internal, there is a risk to crash the application when some change
in the ethdev is introduced but no workaround is done in mlx5 PMD.
The users' rules, including the default miss rule for tunnel
offload, should not be impacted by the RETA update. It is improper
to flush all rules when updating RETA.
With this patch, only the default rules will be flushed and
re-created with the new table configuration.
Fixes: 3f2fe392bd ("net/mlx5: fix crash during RETA update")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
For the flows containing sample action, the tag action was added
implicitly to store the unique flow index into metadata register in the
split prefix subflow, and then match on this index in the split suffix
subflow. The metadata register for flow index of sample split subflows
was also used to store application metadata TAG 0 item, this might cause
TAG 0 corruption in the flows with sample actions.
This patch uses the same metadata register C index as used for
ASO action since it's reserved and not used directly by the application,
and adds the checking in validation to make sure not to conflict
with ASO CT in the same flow.
Fixes: b4c0ddbfcc ("net/mlx5: split sample flow into two sub-flows")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Tunnel offload API allows the application to restore packet to
its original form if the chain of flows is missed after DECAP action.
MLX5 PMD provides tunnel offload support only if DV API was enabled.
The patch verifies DV availability before processing with
tunnel offload tasks.
Fixes: 4ec6360de3 ("net/mlx5: implement tunnel offload")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Registration of packet mempools with RTE_PKTMBUF_POOL_PINNED_EXT_MEM
was performed incorrectly: after population of such mempool chunks
only contain memory for rte_mbuf structures, while pointers to actual
external memory are not yet filled. MR LKeys could not be obtained
for external memory addresses of such mempools. Rx datapath assumes
all used mempools are registered and does not fallback to dynamic
MR creation in such case, so no packets could be received.
Skip registration of extmem pools on population because it is useless.
If used for Rx, they are registered at port start.
During registration, recognize such pools, inspect their mbufs
and recover the pages they reside in.
While MRs for these pages may already be created by rte_dev_dma_map(),
they are not reused to avoid synchronization on Rx datapath
in case these MRs are changed in the database.
Fixes: 690b2a88c2 ("common/mlx5: add mempool registration facilities")
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
When a user specifies meter policy like "g_actions queue / end
y_actions queue / r_action drop / end", validation logic missed
to set meter policy mode and it took a random value from the stack.
Define ALL policy modes for the mentioned cases.
Fixes: 4b7bf3ffb4 ("net/mlx5: support yellow in meter policy validation")
Cc: stable@dpdk.org
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Bing Zhao <bingz@nvidia.com>
After yellow color actions in the metering policy were supported,
the RSS could be used for both green and yellow colors and only the
queues attribute could be different.
When specifying the attributes of a RSS, some fields can be ignored
and some default values will be used in PMD. For example, there is a
default RSS key in the PMD and it will be used to create the TIR if
nothing is provided by the application.
The default value cases were missed in the current implementation
and it would cause some false positives or crashes.
The comparison function should be adjusted to take all cases into
consideration when RSS is used for both green and yellow colors.
Fixes: 4b7bf3ffb4 ("net/mlx5: support yellow in meter policy validation")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Due to kernel driver / FW issues in direct MKEY creation using the DevX
API, this patch replaces the counter MR creation to use wrapped mkey
API.
Fixes: 5382d28c21 ("net/mlx5: accelerate DV flow counter transactions")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Signed-off-by: Matan Azrad <matan@nvidia.com>
Routine to lookup LKey on Rx was assuming that the mbuf address
always belongs to a single mempool: the one associated with an RxQ
or the MPRQ mempool. This assumption is false for split buffers case.
A wrong LKey was looked up, resulting in completion errors.
Modify lookup routines to lookup LKey in the mbuf->pool
for non-MPRQ cases both on Rx datapath and on queue initialization.
Fixes: fec28ca0e3 ("net/mlx5: support mempool registration")
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Reviewed-by: Matan Azrad <matan@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The rdma-core library can map doorbell register in two ways, depending
on the environment variable "MLX5_SHUT_UP_BF":
- as regular cached memory, the variable is either missing or set to
zero. This type of mapping may cause the significant doorbell
register writing latency and requires an explicit memory write
barrier to mitigate this issue and prevent write combining.
- as non-cached memory, the variable is present and set to not "0"
value. This type of mapping may cause performance impact under
heavy loading conditions but the explicit write memory barrier is
not required and it may improve core performance.
The UAR creation function maps a doorbell in one of the above ways
according to the system. In run time, it always adds an explicit memory
barrier after writing to.
In cases where the doorbell was mapped as non-cached memory, the
explicit memory barrier is unnecessary and may impair performance.
The commit [1] solved this problem for a Tx queue. In run time, it
checks the mapping type and provides the memory barrier after writing to
a Tx doorbell register if it is needed. The mapping type is extracted
directly from the uar_mmap_offset field in the queue properties.
This patch shares this code between the drivers and extends the above
solution for each of them.
[1] commit 8409a28573
("net/mlx5: control transmit doorbell register mapping")
Fixes: f8c97babc9 ("compress/mlx5: add data-path functions")
Fixes: 8e196c08ab ("crypto/mlx5: support enqueue/dequeue operations")
Fixes: 4d4e245ad6 ("regex/mlx5: support enqueue")
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The Tx doorbell has different virtual addresses per process.
The secondary process takes the UAR physical page ID of the primary and
mmap it to its own virtual address.
The primary doorbell references were saved in two shared memory
locations: the TxQ structure and a dedicated doorbell array.
Remove the doorbell reference from the TxQ structure and move the
primary processes to take the UAR information from the primary doorbell
array.
Cc: stable@dpdk.org
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
In the multi-process mechanism, there are things that the secondary
process does not perform itself but asks the primary process to perform
for it.
There is a special API for communication between the processes that
receives parameters necessary for the specific action required as well
as a special structure called mp_id that contains the port number of the
processes through which the initial process finds the relevant ETH
device for the processes.
One of the operations performed through this mechanism is the creation
of a memory region, where the secondary process sends the virtual
address as a parameter and the mp_id structure with the port number
inside it.
However, once the memory area management is shared between the drivers
and either port number or ETH device is no longer relevant to them, it
seems unnecessary to continue communicating between the processes
through the mp_id variable.
In this patch we will remove the use of the above structure for all MR
management, and add to the specific parameter of operations a pointer to
the common device that contains everything needed to create/register MR.
Fixes: 9f1d636f3e ("common/mlx5: share MR management")
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Reviewed-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Memory region management has recently been shared between drivers,
including the search for caches in the data plane.
The initial search in the local linear cache of the queue, usually
yields a result and one should not continue searching in the next level
caches.
The function that searches in the local cache gets the pointer to a
device as a parameter, that is not necessary for its operation
but for subsequent searches (which, as mentioned, usually do not
happen).
Transferring the device to a function and maintaining it, takes some
time and causes some impact on performance.
Add the pointer to the device as a field of the mr_ctrl structure. The
field will be updated during control path and will be used only when
needed in the search.
Fixes: fc59a1ec55 ("common/mlx5: share MR mempool registration")
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Reviewed-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The delay drop is the common feature managed on per device basis
and the kernel driver is responsible one for the initialization and
rearming.
By default, the timeout value is set to activate the delay drop when
the driver is loaded.
A private flag "dropless_rq" is used to control the rearming. Only
when it is on, the rearming will be handled once received a timeout
event. Or else, the delay drop will be deactivated after the first
timeout occurs and all the Rx queues won't have this feature.
The PMD is trying to query this flag and warn the application when
some queues are created with delay drop but the flag is off.
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
For the Ethernet RQs, if there all receiving descriptors are
exhausted, the packets being received will be dropped. This behavior
prevents slow or malicious software entities at the host from
affecting the network. While for hairpin cases, even if there is no
software involved during the packet forwarding from Rx to Tx side,
some hiccup in the hardware or back pressure from Tx side may still
cause the descriptors to be exhausted. In certain scenarios it may be
preferred to configure the device to avoid such packet drops,
assuming the posting of descriptors will resume shortly.
To support this, a new devarg "delay_drop" is introduced. By default,
the delay drop is enabled for hairpin Rx queues and disabled for
standard Rx queues. This value is used as a bit mask:
- bit 0: enablement of standard Rx queue
- bit 1: enablement of hairpin Rx queue
And this attribute will be applied to all Rx queues of a device.
The "rq_delay_drop" capability in the HCA_CAP is checked before
creating any queue. If the hardware capabilities do not support
this delay drop, all the Rx queues will still be created without
this attribute, and the devarg setting will be ignored even if it
is specified explicitly. A warning log is used to notify the
application when this occurs.
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
When receive packet, mlx5 PMD saves mbuf port number from
RxQ data.
To support shared RxQ, save port number into RQ context as user index.
Received packet resolve port number from CQE user index which derived
from RQ context.
Legacy Verbs API doesn't support RQ user index setting, still read from
RxQ port number.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
This patch introduces shared RxQ. All shared Rx queues with same group
and queue ID share the same rxq_ctrl. Rxq_ctrl and rxq_data are shared,
all queues from different member port share same WQ and CQ, essentially
one Rx WQ, mbufs are filled into this singleton WQ.
Shared rxq_data is set into device Rx queues of all member ports as
RxQ object, used for receiving packets. Polling queue of any member
ports returns packets of any member, mbuf->port is used to identify
source port.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Rx queue data list(priv->rxqs) can be replaced by Rx queue
list(priv->rxq_privs), removes it and replaces with universal wrapper
API.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
To support shared RX queue, moves DevX RQ which is per queue resource to
Rx queue private data.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
To prepare for shared Rx queue, removes port info from shareable Rx
queue control.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Hairpin info of Rx queue can't be shared, moves to private queue data.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Rx queue reference count is counter of RQ, used to count reference to RQ
object. To prepare for shared Rx queue, this patch moves it from
rxq_ctrl to Rx queue private data.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
To prepare shared Rx queue, splits RxQ data into shareable and private.
Struct mlx5_rxq_priv is per queue data.
Struct mlx5_rxq_ctrl is shared queue resources and data.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
If error happened during Rx queue mbuf allocation, boolean value
returned. From description, return value should be error number.
This patch returns negative error number.
Fixes: 0f20acbf5e ("net/mlx5: implement vectorized MPRQ burst")
Cc: stable@dpdk.org
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The hardware Receive Memory Pool (RMP) object holds the destination for
incoming packets/messages that are routed to the RMP through RQs. RMP
enables sharing of memory across multiple Receive Queues. Multiple
Receive Queues can be attached to the same RMP and consume memory
from that shared poll. When using RMPs, completions are reported to the
CQ pointed to by the RQ, user index that set in RQ creation time is
carried to completion entry.
This patch enables RMP based RQ, RMP is created when mlx5_devx_rq.rmp is
set.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
This patch fixes stale field reference.
Fixes: a18ac61133 ("net/mlx5: add metadata support to Rx datapath")
Cc: stable@dpdk.org
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The matcher is an steering engine entity that represents
the flow pattern to hardware to match. It order to
provide match on the flex item pattern the appropriate
matcher fields should be configured with values and masks
accordingly.
The flex item related matcher fields is an array of eight
32-bit fields to match with data captured by sample registers
of configured flex parser. One packet field, presented in
item pattern can be split between several sample registers,
and multiple fields can be combined together into single
sample register to optimize hardware resources usage
(number os sample registers is limited), depending on field
modes, widths and offsets. Actual mapping is complicated
and controlled by special translation data, built by PMD
on flex item creation.
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
RTE Flow flex item configuration should be translated
into actual hardware settings:
- translate header length and next protocol field samplings
- translate data field sampling, the similar fields with the
same mode and matching related parameters are relocated
and grouped to be covered with minimal amount of hardware
sampling registers (each register can cover arbitrary
neighbour 32 bits (aligned to byte boundary) in the packet
and we can combine the fields with smaller lengths or
segments of bigger fields)
- input and output links translation
- preparing data for parsing flex item pattern on flow creation
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
The DevX flex parsers can be shared between representors
within the same IB context. We should put the flex parser
objects into the shared list and engage the standard
mlx5_list_xxx API to manage ones.
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Reviewed-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
This patch is a preparation step of implementing
flex item feature in driver and it provides:
- external entry point routines for flex item
creation/deletion
- flex item objects management over the ports.
The flex item object keeps information about
the item created over the port - reference counter
to track whether item is in use by some active
flows and the pointer to underlying shared DevX
object, providing all the data needed to translate
the flow flex pattern into matcher fields according
hardware configuration.
There is not too many flex items supposed to be
created on the port, the design is optimized
rather for flow insertion rate than memory savings.
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
To handle eCPRI protocol in the flows the mlx5 PMD engages
flex parser hardware feature. While we were implementing
eCPRI support we anticipated the flex parser usage extension,
and all related variables were named accordingly, containing
flex syllabus. Now we are preparing to introduce more common
approach of flex item, in order to avoid naming conflicts
and improve the code readability the eCPRI infrastructure
related variables are renamed as preparation step.
Later, once we have the new flex item implemented, we could
consider to refactor the eCPRI protocol support to move on
common flex item basis.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
MLX5 PMD uses reference counting to manage RX queue resources.
After port stop shared RSS actions kept references to RX queues,
preventing resource release. As a result, internal PMD mempool
for such queues had been exhausted after a number of port restarts.
Diagnostic message from rte_eth_dev_start():
Rx queue allocation failed: Cannot allocate memory
Dereference RX queues used by indirect actions on port stop (detach)
and restore references on port start (attach) in order to allow RX queue
resource release, but keep indirect RSS across the port restart.
Replace queue IDs in HW by drop queue ID on detach and restore actual
queue IDs on attach.
When the port is stopped, create indirect RSS in the detached state.
As a result, MLX5 PMD is able to keep all its indirect actions
across port restart. Advertise this capability.
Fixes: 4b61b8774b ("ethdev: introduce indirect flow action")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Drop queue creation and destruction were not implemented for DevX
flow engine and Verbs engine methods were used as a workaround.
Implement these methods for DevX so that there is a valid queue ID
that can be used regardless of queue configuration via API.
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Maximum available flow priority was discovered using Verbs API
regardless of the selected flow engine. This required some Verbs
objects to be initialized in order to use DevX engine. Make priority
discovery an engine method and implement it for DevX using its API.
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The RSS expansion algorithm is using a graph to find the possible
expansion paths. A graph node with the 'explicit' flag will be skipped,
if it is not found in the flow pattern.
The current implementation misses a check for the explicit flag when
expanding the pattern according to ETH item with EtherType.
For example:
testpmd> flow create 0 ingress pattern eth / ipv6 / udp / vxlan / eth
type is 2048 / end actions rss level 2 types udp end / end
The "eth type is 2048" item in the pattern may be expanded to "ETH IPv4".
The ETH node in the expansion graph is followed by VLAN node marked as
explicit. The fix is to skip the VLAN node and continue the expansion
with its next nodes, IPv4 and IPv6.
The expansion paths for the above example will be:
ETH IPV6 UDP VXLAN ETH END
ETH IPV6 UDP VXLAN ETH IPV4 UDP END
Fixes: 69d268b4ff ("net/mlx5: fix RSS expansion for explicit graph node")
Cc: stable@dpdk.org
Signed-off-by: Lior Margalit <lmargalit@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The ASO meter action with flows creation could be supported on
multiple threads. The meter pools were created to manage the meter
object resources, if there is no room in the current meter pool then
resize the meter pool to the new pool size and free the old one.
There's a race condition while one thread resizes the meter pool and
the old pool resource be freed, and another thread query the meter
object by index on the old pool, the return value is invalid.
This patch adds a read-write lock to protect the pool resource while
resizing and query.
Fixes: a5835d530f ("net/mlx5: optimize Rx queue match")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The age action with flows creation could be supported on the multiple
threads. The age pools were created to manage the age resources, if
there is no room in the current pool then resize the age pool to the new
pool size and free the old one.
There's a race condition while one thread resizes the age pool and the
old pool resource be freed, and another thread query the age action
value of the old pool so the queried value is invalid.
This patch uses the read-write lock to protect the pool resource while
resizing and query.
Fixes: a5835d530f ("net/mlx5: optimize Rx queue match")
Cc: stable@dpdk.org
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
If for any reason, a socket could not be opened, mlx5_pmd_socket_init()
could close the 0 fd (which is valid, and has a fair chance to be stdin),
since server_socket == 0 from the variable being in .bss.
Fixes: e6cdc54cc0 ("net/mlx5: add socket server for external tools")
Cc: stable@dpdk.org
Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
The MODIFY_FIELD RTE action rejects copy to/from metadata
in case of the legacy mode extensive flow metadata support.
It is not consistent with SET_META action that has no such
restriction imposed. Registers A or B are used for META in
legacy mode. Allow meta modifications in legacy mode as well.
On other hand, SET_META rejects actions in case register C
is not available even though it is not needed in legacy mode.
Skip this check for legacy mode and allow setting META.
Fixes: edf325d421 ("net/mlx5: check extended metadata for meta modification")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Register C is used for the metadata within NIC Rx domain.
And its width can vary from 0 to 32 bits depending on
its kernel usage. But it is not the case within NIC Tx domain,
register A is always 32 bits there. Fix metadata width detection
for the modify_field flow API within NIC Tx domain.
Fixes: 6d5735c1cb ("net/mlx5: fix meta register conversion for extensive mode")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
This patch fixes RETA updating for entries above 64.
Without that, these entries are never updated as
calculated mask value will always be 0.
Fixes: 634efbc2c8 ("mlx5: support RETA query and update")
Cc: stable@dpdk.org
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Integrity item validation and translation must verify that integrity
item bits match L3 and L4 items in flow rule pattern.
For cases when integrity item was positioned before L3 header, such
verification must be split into two stages.
The first stage detects integrity flow item and makes initializations
for the second stage.
The second stage is activated after PMD completes processing of all
flow items in rule pattern. PMD accumulates information about flow
items in flow pattern. When all pattern flow items were processed,
PMD can apply that data to complete integrity item validation
and translation.
Fixes: 79f8952783 ("net/mlx5: support integrity flow item")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
MLX5 PMD can match on integrity bits for inner and outer headers in
a single flow.
That means a single flow rule can reference both inner and outer
integrity bits. That is implemented by adding 2 flow integrity items
to a rule - one item for outer integrity bits and other for
inner integrity bits.
Integrity item `level` parameter specifies what part is being
targeted.
Current PMD treated integrity items for outer and inner headers as
the same.
The patch separates PMD verifications for inner and outer integrity
items.
Fixes: 79f8952783 ("net/mlx5: support integrity flow item")
Cc: stable@dpdk.org
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Multiple rules could use the same encap_decap/modify_hdr/counter action.
The flow dump data could be duplicated.
To avoid redundancy, flow dump value is based on the actions' pointer
instead of previous rules' pointer.
For counter, the data is stored in cmng of priv->sh.
For encap_decap/modify_hdr, the data stored in encaps_decaps/modify_cmds.
Traverse the fields and get action's pointer and information.
Formats are same for information in the dump except "id" stands for
actions' pointer:
Counter: rec_type,id,hits,bytes
Modify_hdr: rec_type,id,actions_number,actions
Encap_decap: rec_type,id,buf
Signed-off-by: Haifei Luo <haifeil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
During the device spawn process, mlx5 PMD queried the available flow
priorities by calling mlx5_flow_discover_priorities, queried
if the DR drop action was supported on the root table by calling
the mlx5_flow_discover_dr_action_support routine, and queried the
availability of metadata register C by calling mlx5_flow_discover_mreg_c
These functions created the test flows to get the supported fields, and
at the end destroyed the test flows. The test flows in the first two
functions was created on the root table.
If the device was spawned with multiple representors, these test flows
were created and destroyed on each representor as well. The above
operations took a significant amount of init time during the device
spawn.
This patch optimizes the device discover functions, if there is
the device with multiple representors (VF/SF) being spawned,
the priority and drop action and metadata register support check can be
done only ones and check results can be shared for all representors.
Signed-off-by: Jiawei Wang <jiaweiw@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
In socket direct mode, it's possible to bind any two (maybe four
in future) PCIe devices with IDs like xxxx:xx:xx.x and
yyyy:yy:yy.y. Bonding member interfaces are unnecessary to have
the same PCIe domain/bus/device ID anymore,
Kernel driver uses "system_image_guid" to identify if devices can
be bound together or not. Sysfs "phys_switch_id" is used to get
"system_image_guid" of each network interface.
OFED 5.4+ is required to support "phys_switch_id".
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Removing direct access to interrupt handle structure fields,
rather use respective get set APIs for the same.
Making changes to all the drivers access the interrupt handle fields.
Signed-off-by: Harman Kalra <hkalra@marvell.com>
Acked-by: Hyong Youb Kim <hyonkim@cisco.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Tested-by: Raslan Darawsheh <rasland@nvidia.com>
Fix the mbuf offload flags namespace by adding an RTE_ prefix to the
name. The old flags remain usable, but a deprecation warning is issued
at compilation.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Somnath Kotur <somnath.kotur@broadcom.com>
The flags PKT_TX_VLAN_PKT and PKT_TX_QINQ_PKT are
marked as deprecated since commit 380a7aab1a ("mbuf: rename deprecated
VLAN flags") (2017). But they were not using the RTE_DEPRECATED
macro, because it did not exist at this time. Add it, and replace
usage of these flags.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Add 'RTE_ETH' namespace to all enums & macros in a backward compatible
way. The macros for backward compatibility can be removed in next LTS.
Also updated some struct names to have 'rte_eth' prefix.
All internal components switched to using new names.
Syntax fixed on lines that this patch touches.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Wisam Jaddo <wisamm@nvidia.com>
Acked-by: Rosen Xu <rosen.xu@intel.com>
Acked-by: Chenbo Xia <chenbo.xia@intel.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Acked-by: Somnath Kotur <somnath.kotur@broadcom.com>
Previously, we set txq affinity to 0 and let firmware
to perform round-robin when bonding. Firmware uses a
global counter to assign txq affinity to different
physical ports accord to remainder after division.
There are three dis-advantages:
1. The global counter is shared between kernel and dpdk.
2. After restarting pmd or port, the previous counter value
is reused, so the new affinity is unpredictable.
3. There is no way to get what affinity is set by firmware.
In this update, we will create several TISs up to the
number of bonding ports and bind each TIS to one PF port.
For each port, it will start to pick up TIS using its port
index. Upper layer application can quickly calculate each txq's
affinity without querying.
At DPDK layer, when creating txq with 2 bonding ports, the
affinity is set like:
port 0: 1-->2-->1-->2
port 1: 2-->1-->2-->1
port 2: 1-->2-->1-->2
Note: Only applicable to DevX api.
This affinity subjects to HW hash.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
MLX5 PMD exposes a socket for external tools to dump port state.
Socket events are listened using an interrupt source of EXT type.
The socket was closed and the interrupt callback was unregistered
at program exit, which is incorrect because DPDK could be already
shut down at this point. Move actions performed at program exit
to the moment the last MLX5 port is closed. The socket will be opened
again if later a new MLX5 device is plugged in and probed.
Also fix comments that were decisively talking
about secondary processes instead of external tools.
Fixes: e6cdc54cc0 ("net/mlx5: add socket server for external tools")
Cc: stable@dpdk.org
Reported-by: Harman Kalra <hkalra@marvell.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
mlx5_rxq_start() allocates rxq_ctrl->obj and frees it on failure,
but did not set it to NULL. Later mlx5_rxq_release() could not recognize
this object is already freed and attempted to release its resources,
resulting in a crash:
Configuring Port 0 (socket 0)
mlx5_common: Failed to create RQ using DevX
mlx5_common: Can't create DevX RQ object.
mlx5_net: Port 0 Rx queue 0 RQ creation failure.
Segmentation fault
Set rxq_ctrl->obj to NULL after it is freed to skip resource release.
Fixes: 1260a87b28 ("net/mlx5: share Rx control code")
Cc: stable@dpdk.org
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
The RSS configuration in a policy action container was a pointer
inside a union, and the pointer area could be used as other fate
action. In the current implementation, the RSS of the green color
was prior to that of the yellow color. There was a high possibility
the pointer was considered as the RSS and result in a error flow
expansion when only the yellow color had the RSS action.
The check of the fate action type should also be done to get rid of
the misjudgment.
Fixes: b38a12272b ("net/mlx5: split meter color policy handling")
Cc: stable@dpdk.org
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Verbs API doesn't support device port number larger than 255 by design.
To support more VF or SubFunction port representors, forces DevX API
check when max Verbs device link ports larger than 255.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Verbs API does not support Infiniband device port number larger 255 by
design. To support more representors on a single Infiniband device DevX
API should be engaged.
While creating Send Queue (SQ) object with Verbs API, the PMD assigned
IB device port attribute and kernel created the default miss flows in
FDB domain, to redirect egress traffic from the queue being created to
representor appropriate peer (wire, HPF, VF or SF).
With DevX API there is no IB-device port attribute (it is merely kernel
one, DevX operates in PRM terms) and PMD must create default miss flows
in FDB explicitly. PMD did not provide this and using DevX API for
E-Switch configurations was disabled.
The default miss FDB flow matches E-Switch manager vport (to make sure
the source is some representor) and SQn (Send Queue number - device
internal queue index). The root flow table managed by kernel/firmware
and it does not support vport redirect action, we have to split the
default miss flow into two ones:
- flow with lowest priority in the root table that matches E-Switch
manager vport ID and jump to group 1.
- flow in group 1 that matches E-Switch manager vport ID and SQn and
forwards packet to peer vport
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
When creating internal transfer flow on root table with lowest
priority, the flow was created with max UINT32_MAX priority. It is wrong
since the flow is created in kernel and max priority supported is 16.
This patch fixes this by adding internal flow check.
Fixes: 5f8ae44dd4 ("net/mlx5: enlarge maximal flow priority")
Cc: stable@dpdk.org
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Extends txq flow pattern to support both hairpin and regular txq.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
For egress packet on representor, the vport ID in transport domain
is E-Switch manager vport ID since representor shares resources of
E-Switch manager. E-Switch manager vport ID and Tx queue internal device
index are used to match representor egress packet.
This patch adds flow item port ID match on E-Switch manager.
E-Switch manager vport ID is 0xfffe on BlueField, 0 otherwise.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
To detect number flow Verbs flow priorities, PMD try to create Verbs
flows in different priority. While Verbs is not designed to support
ports larger than 255.
When DevX supported by kernel driver, 16 Verbs priorities must be
supported, no need to create Verbs flows.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
IB spec doesn't allow 255 ports on a single HCA, port number of 256 was
cast to u8 value 0 which invalid to ibv_query_port()
This patch invokes Netlink API to query port state when port number
greater than 255.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Expand the use of mempool registration to MR management for other
drivers.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Since MR management has moved to the common area, there is no longer a
need for the DMA map and unmap function for each driver.
This patch share those functions. For most drivers it supports these
operations for the first time.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Add global shared MR cache as a field of common device structure.
Move MR management to use this global cache for all drivers.
Signed-off-by: Michael Baum <michaelba@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>