A new generic shared actions API may be used to create shared
counter. There is no point to keep duplicate COUNT action specific
capability to create shared counters.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
In the experimental function rte_flow_shared_action_destroy()
introduced in DPDK 20.11, the errno ETOOMANYREFS was used.
This errno is not always available on Windows,
so it is preferred using EBUSY instead.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Tal Shnaiderman <talshn@nvidia.com>
Tested-by: Tal Shnaiderman <talshn@nvidia.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
This patch defines new RSS offload types for eCPRI. For eCPRI with
Message Type 0, the hash field is physical channel ID.
Signed-off-by: Simei Su <simei.su@intel.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
The ethdev port id is 16 bits now. This patch fixes the data type
of the variable for 'pid', which changing from uint32_t to uint16_t.
RTE_MAX_ETHPORTS is the maximum number of ports, which customized by
the user. To avoid 16-bit unsigned integer overflow, the valid value
of RTE_MAX_ETHPORTS should be set from 0 to UINT16_MAX, and it is
safer to cut one more port from space.
So we use RTE_BUILD_BUG_ON() to ensure that RTE_MAX_ETHPORTS is less
to UINT16_MAX.
Fixes: 5b7ba31148 ("ethdev: add port ownership")
Cc: stable@dpdk.org
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Coverity flags that 'rx_conf' variable is used before
it's checked for NULL. This patch fixes this issue.
Coverity issue: 363570
Fixes: 4ff702b5df ("ethdev: introduce Rx buffer split")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
In a flow rule, attribute "transfer" means operation level
at which both traffic is matched and actions are conducted.
Add the very same attribute to shared action configuration.
If a driver needs to prepare HW resources in two different
ways, depending on the operation level, in order to set up
an action, then this new attribute will indicate the level.
Also, when handling a flow rule insertion, the driver will
be able to turn down a shared action if its level is unfit.
Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Andrey Vesnovaty <andreyv@nvidia.com>
Updated description of rte_eth_rx_burst() to reflect what drivers,
when using vector instructions, expect from nb_pkts.
Also discussed on the mailing list here:
http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35C61257@smartserver.smartshare.dk/
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
net/ixgbe driver is the only user of the struct rte_eth_l2_tunnel_conf.
Move it to the driver and use ixgbe_ prefix instead of rte_eth_.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
The legacy filter API, including rte_eth_dev_filter_supported() and
rte_eth_dev_filter_ctrl() is removed. Flow API should be used.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Instead of FDIR filters RTE flow API should be used.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Acked-by: Hyong Youb Kim <hyonkim@cisco.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Global filter configuration request was supported by net/i40e
driver only to configure GRE key length.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Instead of HASH filter RTE flow API should be used.
Preserve RTE_ETH_FILTER_HASH since it is used in drivers
internally in RTE flow API support.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Instead of TUNNEL filter RTE flow API should be used.
Move corresponding defines and helper structure to ethdev
driver interface since it is still used by drivers internally.
Preserve RTE_ETH_FILTER_TUNNEL because of usage in drivers.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Instead of N-tuple filter RTE flow API should be used.
Preserve struct rte_eth_ntuple_filter in ethdev API since
the structure and related defines are used in flow classify
library and a number of drivers.
Preserve RTE_ETH_FILTER_NTUPLE because of usage in drivers.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Instead of SYN filter RTE flow API should be used.
Move corresponding definitions to ethdev internal driver API
since it is used by drivers internally.
Preserve RTE_ETH_FILTER_SYN because of it as well.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
net/e1000 driver is the only user of the struct rte_eth_flex_filter
and helper defines. Move it to the driver and use igb_ prefix
instead of rte_eth_.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Instead of EtherType filter RTE flow API should be used.
Move corresponding definitions to ethdev internal driver API
since it is used by drivers internally.
Preserve RTE_ETH_FILTER_ETHERTYPE because of it as well.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
net/i40e driver is the only user of the enum rte_mac_filter_type.
Move the define to the driver and use i40e_ prefix instead of rte_.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Instead of MACVLAN filter RTE flow API should be used.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
The definitions of RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP
and RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP were inserted
before the last comment of Tx offloads.
It is moved in a better place,
with comments moved to be before the definition.
A group comment is added to better describe device capabilities.
Fixes: cac923cfea ("ethdev: support runtime queue setup")
Cc: stable@dpdk.org
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
When a flow API RSS rule is issued in testpmd, device RSS key is changed
unexpectedly, device RSS key is changed to the testpmd default RSS key.
Consider the following usage with testpmd:
1. first, startup testpmd:
testpmd> show port 0 rss-hash key
RSS functions: all ipv4-frag ipv4-other ipv6-frag ipv6-other ip
RSS key: 6D5A56DA255B0EC24167253D43A38FB0D0CA2BCBAE7B30B477CB2DA38030F
20C6A42B73BBEAC01FA
2. create a rss rule
testpmd> flow create 0 ingress pattern eth / ipv4 / udp / end \
actions rss types ipv4-udp end queues end / end
3. show rss-hash key
testpmd> show port 0 rss-hash key
RSS functions: all ipv4-udp udp
RSS key: 74657374706D6427732064656661756C74205253532068617368206B65792
C206F76657272696465
This is because testpmd always sends a key with the RSS rule,
if user provides a key as part of the rule that key is used, if user
doesn't provide a key, testpmd default key is sent to the PMDs, which is
causing device programmed RSS key to be changed.
There was a previous attempt to fix the same issue [1], but it has been
reverted back [2] because of the crash when 'key_len' is provided
without 'key'.
This patch follows the same approach with the initial fix [1] but also
addresses the crash.
After change, testpmd RSS key is 'NULL' by default, if user provides a
key as part of rule it is used, if not no key is sent to the PMDs at all
[1]
Commit a4391f8bae ("app/testpmd: set default RSS key as null")
[2]
Commit f3698c3d09 ("app/testpmd: revert setting default RSS")
Fixes: d0ad8648b1 ("app/testpmd: fix RSS flow action configuration")
Cc: stable@dpdk.org
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Previously, the Tx timestamp field and flag were registered in testpmd,
as described in mlx5 guide.
For consistency between Rx and Tx timestamps,
managing mbuf registrations inside the driver, as properly documented,
is a simpler expectation.
The only driver to support this feature (mlx5) is updated
as well as the testpmd application.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
The offload flag DEV_RX_OFFLOAD_TIMESTAMP had no documentation.
After switching to dynamic mbuf flag and field,
it becomes even more important to explicit the feature behaviour.
A doxygen comment for the timesync API was mentioning
the deprecated timestamp field, so it is also updated.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
Rename new rte_flow ops callbacks to emphasize relation to tunnel
offload API.
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Ray Kinsella <mdr@ashroe.eu>
As discussed on the dpdk-dev mailing list[1], we can make some easy
improvements in standardizing the naming of the various components in DPDK,
and their associated feature-enabled macros.
Following this patch, each library will have the name in format,
'librte_<name>.so', and the macro indicating that library is enabled in the
build will have the form 'RTE_LIB_<NAME>'.
Similarly, for libraries, the equivalent name formats and macros are:
'librte_<class>_<name>.so' and 'RTE_<CLASS>_<NAME>', where class is the
device type taken from the relevant driver subdirectory name, i.e. 'net',
'crypto' etc.
To avoid too many changes at once for end applications, the old macro names
will still be provided in the build in this release, but will be removed
subsequently.
[1] http://inbox.dpdk.org/dev/ef7c1a87-79ab-e405-4202-39b7ad6b0c71@solarflare.com/t/#u
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Luca Boccassi <bluca@debian.org>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Rosen Xu <rosen.xu@intel.com>
Since each version map file is contained in the subdirectory of the library
it refers to, there is no need to include the library name in the filename.
This makes things simpler in case of library renaming.
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Luca Boccassi <bluca@debian.org>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Rosen Xu <rosen.xu@intel.com>
Prefix static function and variables with 'eth_dev'.
For some 'rte_' prefix dropped, and for others 'eth_dev' added.
This is useful to differentiate public and static function/variables.
The cleanup is good to for having consistent naming to help new
additions naming.
No functional change, only naming.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Gaetan Rivet <grive@u256.net>
Use ENODEV as the error code if specified port ID is invalid.
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Queue stats will be removed from basic stats to xstats.
It will be PMDs responsibility to fill queue stats based on number of
queues they have.
Until all PMDs implement the xstats, a temporary
'RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS' device flag created. PMDs switched
to the xstats should clear this flag to bypass the ethdev layer autofill
for queue stats.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Ray Kinsella <mdr@ashroe.eu>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Queue stats are stored in 'struct rte_eth_stats' as array and array size
is defined by 'RTE_ETHDEV_QUEUE_STAT_CNTRS' compile time flag.
As a result of technical board discussion, decided to remove the queue
statistics from 'struct rte_eth_stats' in the long term.
Instead PMDs should represent the queue statistics via xstats, this
gives more flexibility on the number of the queues supported.
Currently queue stats in the xstats are filled by ethdev layer, using
some basic stats, when queue stats removed from basic stats the
responsibility to fill the relevant xstats will be pushed to the PMDs.
During the switch period, temporary 'RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS'
device flag is created. Initially all PMDs using xstats set this flag.
The PMDs implemented queue stats in the xstats should clear the flag.
When all PMDs switch to the xstats for the queue stats, queue stats
related fields from 'struct rte_eth_stats' will be removed, as well as
'RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS' flag.
Later 'RTE_ETHDEV_QUEUE_STAT_CNTRS' compile time flag also can be
removed.
Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Haiyue Wang <haiyue.wang@intel.com>
Acked-by: Xiao Wang <xiao.w.wang@intel.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Change eth_dev_stop_t return value from void to int.
Make eth_dev_stop_t implementations across all drivers to return
negative errno values if case of error conditions.
Signed-off-by: Ivan Ilchenko <ivan.ilchenko@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Change rte_eth_dev_stop() return value from void to int
and return negative errno values in case of error conditions.
Also update the usage of the function in ethdev according to
the new return type.
Signed-off-by: Ivan Ilchenko <ivan.ilchenko@oktetlabs.ru>
Signed-off-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
The API function rte_eth_dev_close() was returning void.
The return type is changed to int for notifying of errors.
If an error happens during a close operation,
the status of the port is undefined,
a maximum of resources having been freed.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Reviewed-by: Liron Himi <lironh@marvell.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
The function rte_eth_dev_release_port() is partially resetting
the struct rte_eth_dev. The drivers were completing this reset
with more pointers set to NULL in the close or remove operations.
More pointers are reset at ethdev level,
and some redundant assignments are removed from PMDs.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Haiyue Wang <haiyue.wang@intel.com>
Acked-by: Jeff Guo <jia.guo@intel.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
When closing a port, it is supposed to be already stopped,
and marked as such with "dev_started" state zeroed by the stop API.
Resetting "dev_started" before calling the driver close operation
was hiding the case of not properly stopped port being closed.
The flag "dev_started" is not changed anymore in "rte_eth_dev_close()".
In case the "dev_stop" function is called from "dev_close",
bypassing "rte_eth_dev_stop()" API,
the "dev_started" state must be explicitly reset in the PMD
in order to keep the same behaviour.
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
The DPDK datapath in the transmit direction is very flexible.
An application can build the multi-segment packet and manages
almost all data aspects - the memory pools where segments
are allocated from, the segment lengths, the memory attributes
like external buffers, registered for DMA, etc.
In the receiving direction, the datapath is much less flexible,
an application can only specify the memory pool to configure the
receiving queue and nothing more. In order to extend receiving
datapath capabilities it is proposed to add the way to provide
extended information how to split the packets being received.
The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
capabilities is introduced to present the way for PMD to report to
application about supporting Rx packet split to configurable
segments. Prior invoking the rte_eth_rx_queue_setup() routine
application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
The following structure is introduced to specify the Rx packet
segment for RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload:
struct rte_eth_rxseg_split {
struct rte_mempool *mp; /* memory pools to allocate segment from */
uint16_t length; /* segment maximal data length,
configures "split point" */
uint16_t offset; /* data offset from beginning
of mbuf data buffer */
uint32_t reserved; /* reserved field */
};
The segment descriptions are added to the rte_eth_rxconf structure:
rx_seg - pointer the array of segment descriptions, each element
describes the memory pool, maximal data length, initial
data offset from the beginning of data buffer in mbuf.
This array allows to specify the different settings for
each segment in individual fashion.
rx_nseg - number of elements in the array
If the extended segment descriptions is provided with these new
fields the mp parameter of the rte_eth_rx_queue_setup must be
specified as NULL to avoid ambiguity.
There are two options to specify Rx buffer configuration:
- mp is not NULL, rrx_conf.rx_nseg is zero, it is compatible
configuration, follows existing implementation, provides
the single pool and no description for segment sizes
and offsets.
- mp is NULL, rx_conf.rx_seg is not NULL, rx_conf.rx_nseg is not
zero, it provides the extended configuration, individually for
each segment.
f the Rx queue is configured with new settings the packets being
received will be split into multiple segments pushed to the mbufs
with specified attributes. The PMD will split the received packets
into multiple segments according to the specification in the
description array.
For example, let's suppose we configured the Rx queue with the
following segments:
seg0 - pool0, len0=14B, off0=2
seg1 - pool1, len1=20B, off1=128B
seg2 - pool2, len2=20B, off2=0B
seg3 - pool3, len3=512B, off3=0B
The packet 46 bytes long will look like the following:
seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
seg1 - 20B long @ 128 in mbuf from pool1
seg2 - 12B long @ 0 in mbuf from pool2
The packet 1500 bytes long will look like the following:
seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
seg1 - 20B @ 128 in mbuf from pool1
seg2 - 20B @ 0 in mbuf from pool2
seg3 - 512B @ 0 in mbuf from pool3
seg4 - 512B @ 0 in mbuf from pool3
seg5 - 422B @ 0 in mbuf from pool3
The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
configured to support new buffer split feature (if rx_nseg
is greater than one).
The split limitations imposed by underlying PMD is reported
in the new introduced rte_eth_dev_info->rx_seg_capa field.
The new approach would allow splitting the ingress packets into
multiple parts pushed to the memory with different attributes.
For example, the packet headers can be pushed to the embedded
data buffers within mbufs and the application data into
the external buffers attached to mbufs allocated from the
different memory pools. The memory attributes for the split
parts may differ either - for example the application data
may be pushed into the external memory located on the dedicated
physical device, say GPU or NVMe. This would improve the DPDK
receiving datapath flexibility with preserving compatibility
with existing API.
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
rte_flow API provides the building blocks for vendor-agnostic flow
classification offloads. The rte_flow "patterns" and "actions"
primitives are fine-grained, thus enabling DPDK applications the
flexibility to offload network stacks and complex pipelines.
Applications wishing to offload tunneled traffic are required to use
the rte_flow primitives, such as group, meta, mark, tag, and others to
model their high-level objects. The hardware model design for
high-level software objects is not trivial. Furthermore, an optimal
design is often vendor-specific.
When hardware offloads tunneled traffic in multi-group logic,
partially offloaded packets may arrive to the application after they
were modified in hardware. In this case, the application may need to
restore the original packet headers. Consider the following sequence:
The application decaps a packet in one group and jumps to a second
group where it tries to match on a 5-tuple, that will miss and send
the packet to the application. In this case, the application does not
receive the original packet but a modified one. Also, in this case,
the application cannot match on the outer header fields, such as VXLAN
vni and 5-tuple.
There are several possible ways to use rte_flow "patterns" and
"actions" to resolve the issues above. For example:
1 Mapping headers to a hardware registers using the
rte_flow_action_mark/rte_flow_action_tag/rte_flow_set_meta objects.
2 Apply the decap only at the last offload stage after all the
"patterns" were matched and the packet will be fully offloaded.
Every approach has its pros and cons and is highly dependent on the
hardware vendor. For example, some hardware may have a limited number
of registers while other hardware could not support inner actions and
must decap before accessing inner headers.
The tunnel offload model resolves these issues. The model goals are:
1 Provide a unified application API to offload tunneled traffic that
is capable to match on outer headers after decap.
2 Allow the application to restore the outer header of partially
offloaded packets.
The tunnel offload model does not introduce new elements to the
existing RTE flow model and is implemented as a set of helper
functions.
For the application to work with the tunnel offload API it
has to adjust flow rules in multi-table tunnel offload in the
following way:
1 Remove explicit call to decap action and replace it with PMD actions
obtained from rte_flow_tunnel_decap_and_set() helper.
2 Add PMD items obtained from rte_flow_tunnel_match() helper to all
other rules in the tunnel offload sequence.
VXLAN Code example:
Assume application needs to do inner NAT on the VXLAN packet.
The first rule in group 0:
flow create <port id> ingress group 0
pattern eth / ipv4 / udp dst is 4789 / vxlan / end
actions {pmd actions} / jump group 3 / end
The first VXLAN packet that arrives matches the rule in group 0 and
jumps to group 3. In group 3 the packet will miss since there is no
flow to match and will be sent to the application. Application will
call rte_flow_get_restore_info() to get the packet outer header.
Application will insert a new rule in group 3 to match outer and inner
headers:
flow create <port id> ingress group 3
pattern {pmd items} / eth / ipv4 dst is 172.10.10.1 /
udp dst 4789 / vxlan vni is 10 /
ipv4 dst is 184.1.2.3 / end
actions set_ipv4_dst 186.1.1.1 / queue index 3 / end
Resulting of the rules will be that VXLAN packet with vni=10, outer
IPv4 dst=172.10.10.1 and inner IPv4 dst=184.1.2.3 will be received
decapped on queue 3 with IPv4 dst=186.1.1.1
Note: The packet in group 3 is considered decapped. All actions in
that group will be done on the header that was inner before decap. The
application may specify an outer header to be matched on. It's PMD
responsibility to translate these items to outer metadata.
API usage:
/**
* 1. Initiate RTE flow tunnel object
*/
const struct rte_flow_tunnel tunnel = {
.type = RTE_FLOW_ITEM_TYPE_VXLAN,
.tun_id = 10,
}
/**
* 2. Obtain PMD tunnel actions
*
* pmd_actions is an intermediate variable application uses to
* compile actions array
*/
struct rte_flow_action **pmd_actions;
rte_flow_tunnel_decap_and_set(&tunnel, &pmd_actions,
&num_pmd_actions, &error);
/**
* 3. offload the first rule
* matching on VXLAN traffic and jumps to group 3
* (implicitly decaps packet)
*/
app_actions = jump group 3
rule_items = app_items; /** eth / ipv4 / udp / vxlan */
rule_actions = { pmd_actions, app_actions };
attr.group = 0;
flow_1 = rte_flow_create(port_id, &attr,
rule_items, rule_actions, &error);
/**
* 4. after flow creation application does not need to keep the
* tunnel action resources.
*/
rte_flow_tunnel_action_release(port_id, pmd_actions,
num_pmd_actions);
/**
* 5. After partially offloaded packet miss because there was no
* matching rule handle miss on group 3
*/
struct rte_flow_restore_info info;
rte_flow_get_restore_info(port_id, mbuf, &info, &error);
/**
* 6. Offload NAT rule:
*/
app_items = { eth / ipv4 dst is 172.10.10.1 / udp dst 4789 /
vxlan vni is 10 / ipv4 dst is 184.1.2.3 }
app_actions = { set_ipv4_dst 186.1.1.1 / queue index 3 }
rte_flow_tunnel_match(&info.tunnel, &pmd_items,
&num_pmd_items, &error);
rule_items = {pmd_items, app_items};
rule_actions = app_actions;
attr.group = info.group_id;
flow_2 = rte_flow_create(port_id, &attr,
rule_items, rule_actions, &error);
/**
* 7. Release PMD items after rule creation
*/
rte_flow_tunnel_item_release(port_id,
pmd_items, num_pmd_items);
References
1. https://mails.dpdk.org/archives/dev/2020-June/index.html
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
RTE flow items & actions use positive values in item & action type.
Negative values are reserved for PMD private types. PMD
items & actions usually are not exposed to application and are not
used to create RTE flows.
The patch allows applications with access to PMD flow
items & actions ability to integrate RTE and PMD items & actions
and use them to create flow rule.
RTE flow item or action conversion library accepts positive known
element types with predefined sizes only. Private PMD items and
actions do not fit into this scheme because PMD type values are
negative, each PMD has it's own types numeration and element types and
their sizes are not visible at RTE level. To resolve these
limitations the patch proposes this solution:
1. PMD can expose elements of pointer size only. RTE flow
conversion functions will use pointer size for each configuration
object in private PMD element it processes;
2. RTE flow verification will not reject elements with negative type.
Signed-off-by: Gregory Etelson <getelson@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
This patch implements the change proposes in RFC [1], adding dedicated
fields to ETH and VLAN items structs, to clearly define the required
characteristic of a packet, and enable precise match criteria.
Documentation is updated accordingly.
[1] https://mails.dpdk.org/archives/dev/2020-August/177536.html
Signed-off-by: Dekel Peled <dekelp@nvidia.com>
Acked-by: Matan Azrad <matan@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Every hairpin queue pair should be configured properly and the
connection between Tx and Rx queues should be established, before
hairpin function works. In single port hairpin mode, the queues of
each pair belong to the same device. It is easy to get the hardware
and software information of each queue and configure the hairpin
connection with such information. In two ports hairpin mode, it is
not easy or inappropriate to access one queue's information from
another device.
Since hairpin is configured per queue pair, three new APIs are
introduced and they are internal for the PMD using.
The peer update API helps to pass one queue's information to the
peer queue and get the peer's information back for the next step.
The peer bind API configures the current queue with the peer's
information. For each hairpin queue pair, this API may need to be
called twice to configure the Tx, Rx queues separately.
The peer unbind API resets the current queue configuration and state
to disconnect it from the peer queue. Also, it may need to be called
twice to disconnect Tx, Rx queues from each other.
Some parameter of the above APIs might not be mandatory, and it
depends on the PMD implementation.
The structure of `rte_hairpin_peer_info` is only a declaration and
the actual members will be defined in each PMD when being used.
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
After hairpin queues are configured, in general, the application will
maintain the ports topology and even the queues configuration for
the hairpin. But sometimes it will not.
If there is no hot-plug, it is easy to bind and unbind hairpin among
all the ports. The application can just connect or disconnect the
hairpin egress ports to/from all the probed ingress ports. Then all
the connections could be handled properly.
But with hot-plug / hot-unplug, one port could be probed and removed
dynamically. With two ports hairpin, all the connections from and to
this port should be handled after start(bind) or before stop(unbind).
It is necessary to know the hairpin topology with this port.
This function will return the ports list with the actual peer ports
number after configuration. Either peer Rx or Tx ports will be
gotten with this function call.
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
To support two ports hairpin mode and keep the backward compatibility
for the application, two new attribute members of the hairpin queue
configuration structure will be added.
`tx_explicit` means if the application itself will insert the Tx part
flow rules. If not set, PMD will insert the rules implicitly.
`manual_bind` means if the hairpin Tx queue and peer Rx queue will be
bound automatically during the device start stage.
Different Tx and Rx queue pairs could have different values, but it
is highly recommended that all paired queues between one egress and
its peer ingress ports have the same values, in order not to bring
any chaos to the system. The actual support of these attribute
parameters will be checked and decided by the PMD drivers.
In the single port hairpin, if both are zero without any setting, the
behavior will remain the same as before. It means that no bind API
needs to be called and no Tx flow rules need to be inserted manually
by the application.
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
In single port hairpin mode, all the hairpin Tx and Rx queues belong
to the same device. After the queues are set up properly, there is
no other dependency between the Tx queue and its Rx peer queue. The
binding process that connected the Tx and Rx queues together from
hardware level will be done automatically during the device start
procedure. Everything required is configured and initialized already
for the binding process.
But in two ports hairpin mode, there will be some cross-dependences
between two different ports. Usually, the ports will be initialized
serially by the main thread but not in parallel. The earlier port
will not be able to enable the bind if the following peer port is
not yet configured with HW resources. What's more, if one port is
detached / attached dynamically, it would introduce more trouble
for the hairpin binding.
To overcome these, new APIs for binding and unbinding are added.
During startup, only the hairpin Tx and Rx peer queues will be set
up. Nothing will be done when starting the device if the queues are
without auto-bind attribute. Only after the required ports pair
started, the `rte_eth_hairpin_bind()` API can be called to bind the
all Tx queues of the egress port to the Rx queues of the peer port.
Then the connection between the egress and ingress ports pair will
be established.
The `rte_eth_hairpin_unbind()` API could be used to disconnect the
egress and the peer ingress ports. This should only be called before
the device is closed if needed. When doing the clean up, all the
egress and ingress pairs related to a single port should be taken
into consideration, especially in the hot unplug case.
mode is described.
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Applications handling fragmented IPv6 packets need to match on IPv6
fragment extension header, in order to identify the fragments order
and location in the packet.
This patch introduces the IPv6 fragment extension header item,
proposed in [1].
Relevant definitions are moved from lib/librte_ip_frag/rte_ip_frag.h
to lib/librte_net/rte_ip.h, as they are needed for IPv6 header handling.
struct ipv6_extension_fragment renamed to rte_ipv6_fragment_ext to
adapt it to the common naming convention.
Default mask is not defined, since all fields are optional.
[1] http://mails.dpdk.org/archives/dev/2020-March/160255.html
Signed-off-by: Dekel Peled <dekelp@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Using the current implementation of DPDK, an application cannot match on
IPv6 packets, based on the existing extension headers, in a simple way.
Field 'Next Header' in IPv6 header indicates type of the first extension
header only. Following extension headers can't be identified by
inspecting the IPv6 header.
As a result, the existence or absence of specific extension headers
can't be used for packet matching.
For example, fragmented IPv6 packets contain a dedicated extension header
(which is implemented in a later patch of this series).
Non-fragmented packets don't contain the fragment extension header.
For an application to match on non-fragmented IPv6 packets, the current
implementation doesn't provide a suitable solution.
Matching on the Next Header field is not sufficient, since additional
extension headers might be present in the same packet.
To match on fragmented IPv6 packets, the same difficulty exists.
This patch implements the update as detailed in RFC [1].
A set of additional values will be added to IPv6 header struct.
These values will indicate the existence of every defined extension
header type, providing simple means for identification of existing
extensions in the packet header.
Continuing the above example, fragmented packets can be identified using
the specific value indicating existence of fragment extension header.
To match on non-fragmented IPv6 packets, need to use has_frag_ext 0.
To match on fragmented IPv6 packets, need to use has_frag_ext 1.
To match on any IPv6 packets, the has_frag_ext field should
not be specified for match.
[1] https://mails.dpdk.org/archives/dev/2020-August/177257.html
Signed-off-by: Dekel Peled <dekelp@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
Call back functions are registered on the control plane. They
are accessed from the data plane. Hence, correct memory orderings
should be used to avoid race conditions.
Fixes: 4dc294158c ("ethdev: support optional Rx and Tx callbacks")
Fixes: c8231c63dd ("ethdev: insert Rx callback as head of list")
Cc: stable@dpdk.org
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
While registering the call back functions full write barrier
can be replaced with one-way write barrier.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Introduce extension of flow action API enabling sharing of single
rte_flow_action in multiple flows. The API intended for PMDs, where
multiple HW offloaded flows can reuse the same HW essence/object
representing flow action and modification of such an essence/object
affects all the rules using it.
Motivation and example
===
Adding or removing one or more queues to RSS used by multiple flow rules
imposes per rule toll for current DPDK flow API; the scenario requires
for each flow sharing cloned RSS action:
- call `rte_flow_destroy()`
- call `rte_flow_create()` with modified RSS action
API for sharing action and its in-place update benefits:
- reduce the overhead of multiple RSS flow rules reconfiguration
- optimize resource utilization by sharing action across multiple
flows
Change description
===
Shared action
===
In order to represent flow action shared by multiple flows new action
type RTE_FLOW_ACTION_TYPE_SHARED is introduced (see `enum
rte_flow_action_type`).
Actually the introduced API decouples action from any specific flow and
enables sharing of single action by its handle across multiple flows.
Shared action create/use/destroy
===
Shared action may be reused by some or none flow rules at any given
moment, i.e. shared action resides outside of the context of any flow.
Shared action represent HW resources/objects used for action offloading
implementation.
API for shared action create (see `rte_flow_shared_action_create()`):
- should allocate HW resources and make related initializations required
for shared action implementation.
- make necessary preparations to maintain shared access to
the action resources, configuration and state.
API for shared action destroy (see `rte_flow_shared_action_destroy()`)
should release HW resources and make related cleanups required for shared
action implementation.
In order to share some flow action reuse the handle of type
`struct rte_flow_shared_action` returned by
rte_flow_shared_action_create() as a `conf` field of
`struct rte_flow_action` (see "example" section).
If some shared action not used by any flow rule all resources allocated
by the shared action can be released by rte_flow_shared_action_destroy()
(see "example" section). The shared action handle passed as argument to
destroy API should not be used any further i.e. result of the usage is
undefined.
Shared action re-configuration
===
Shared action behavior defined by its configuration can be updated via
rte_flow_shared_action_update() (see "example" section). The shared
action update operation modifies HW related resources/objects allocated
on the action creation. The number of operations performed by the update
operation should not depend on the number of flows sharing the related
action. On return of shared action update API action behavior should be
according to updated configuration for all flows sharing the action.
Shared action query
===
Provide separate API to query shared action state (see
rte_flow_shared_action_update()). Taking a counter as an example: query
returns value aggregating all counter increments across all flow rules
sharing the counter. This API doesn't query shared action configuration
since it is controlled by rte_flow_shared_action_create() and
rte_flow_shared_action_update() APIs and no supposed to change by other
means.
example
===
struct rte_flow_action actions[2];
struct rte_flow_shared_action_conf conf;
struct rte_flow_action action;
/* skipped: initialize conf and action */
struct rte_flow_shared_action *handle =
rte_flow_shared_action_create(port_id, &conf, &action, &error);
actions[0].type = RTE_FLOW_ACTION_TYPE_SHARED;
actions[0].conf = handle;
actions[1].type = RTE_FLOW_ACTION_TYPE_END;
/* skipped: init attr0 & pattern0 args */
struct rte_flow *flow0 = rte_flow_create(port_id, &attr0, pattern0,
actions, error);
/* create more rules reusing shared action */
struct rte_flow *flow1 = rte_flow_create(port_id, &attr1, pattern1,
actions, error);
/* skipped: for flows 2 till N */
struct rte_flow *flowN = rte_flow_create(port_id, &attrN, patternN,
actions, error);
/* update shared action */
struct rte_flow_action updated_action;
/*
* skipped: initialize updated_action according to desired action
* configuration change
*/
rte_flow_shared_action_update(port_id, handle, &updated_action, error);
/*
* from now on all flows 1 till N will act according to configuration of
* updated_action
*/
/* skipped: destroy all flows 1 till N */
rte_flow_shared_action_destroy(port_id, handle, error);
Signed-off-by: Andrey Vesnovaty <andreyv@nvidia.com>
Acked-by: Ori Kam <orika@nvidia.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>