Commit Graph

8106 Commits

Author SHA1 Message Date
Andy Pei
c20c5e9f88 vhost: get ready with first queue of block device
When boot from virtio blk device, seabios in QEMU only enables one queue.
To work in this scenario, vDPA BLK device back-end configure device
when the first queue is ready.

Signed-off-by: Andy Pei <andy.pei@intel.com>
Signed-off-by: Huang Wei <wei.huang@intel.com>
Reviewed-by: Chenbo Xia <chenbo.xia@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2022-10-26 10:40:34 +02:00
Andy Pei
f92ab3f02c vhost: add type to vDPA device
Add type to rte_vdpa_device to store device type.
Call vdpa ops get_dev_type to fill type when register
vdpa device.

Signed-off-by: Andy Pei <andy.pei@intel.com>
Reviewed-by: Chenbo Xia <chenbo.xia@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2022-10-26 10:40:34 +02:00
Chengwen Feng
c622735dd3 net/bonding: call Tx prepare before Tx burst
Normally, to use the HW offloads capability (e.g. checksum and TSO) in
the Tx direction, the application needs to call rte_eth_tx_prepare() to
do some adjustment with the packets before sending them. But the
tx_prepare callback of the bonding driver is not implemented. Therefore,
the sent packets may have errors (e.g. checksum errors).

However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonded device to determine which slave device's prepare function should
be invoked.

So in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the rte_eth_tx_prepare() will be called before
rte_eth_tx_burst(). In this way, all tx_offloads can be processed
correctly for all NIC devices.

Note: because it is rara that bond different PMDs together, so just
call tx-prepare once in broadcast bonding mode.

Also the following description was added to the rte_eth_tx_burst()
function:
"@note This function must not modify mbufs (including packets data)
unless the refcnt is 1. The exception is the bonding PMD, which does not
have tx-prepare function, in this case, mbufs maybe modified."

Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
Acked-by: Chas Williams <3chas3@gmail.com>
2022-10-20 08:36:34 +02:00
Kalesh AP
eb0d471a89 ethdev: add proactive error handling mode
Some PMDs (e.g. hns3) could detect hardware or firmware errors, one
error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and
wait for application invoke rte_eth_dev_reset() to recover the port,
however, this mode has the following weaknesses:

1) Due to different hardware and software design, some NIC port recovery
process requires multiple handshakes with the firmware and PF (when the
port is VF). It takes a long time to complete the entire operation for
one port, If multiple ports (for example, multiple VFs of a PF) are
reset at the same time, other VFs may fail to be reset. (Because the
reset processing is serial, the previous VFs must be processed before
the subsequent VFs).

2) The impact on the application layer is great, and it should stop
working queues, stop calling Rx and Tx functions, and then call
rte_eth_dev_reset(), and re-setup all again.

This patch introduces proactive error handling mode, the PMD will try
to recover from the errors itself. In this process, the PMD sets the
data path pointers to dummy functions (which will prevent the crash),
and also make sure the control path operations failed with retcode
-EBUSY.

Because the PMD recovers automatically, the application can only sense
that the data flow is disconnected for a while and the control API
returns an error in this period.

In order to sense the error happening/recovering, three events were
introduced:

1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
detected an error and the recovery is being started. Upon receiving the
event, the application should not invoke any control path APIs until
receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
RTE_ETH_EVENT_RECOVERY_FAILED event.

2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
it recovers successful from the error, the PMD already re-configures the
port, and the effect is the same as that of the restart operation.

3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
recovers failed from the error, the port should not usable anymore. The
application should close the port.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
2022-10-17 08:27:18 +02:00
Chengwen Feng
0d5c38bac7 ethdev: add error handling mode to device info
Currently, the defined error handling modes include:

1) NONE: it means no error handling modes are supported by this port.

2) PASSIVE: passive error handling, after the PMD detect that a reset
   is required, the PMD reports RTE_ETH_EVENT_INTR_RESET event, and
   application invoke rte_eth_dev_reset() to recover the port.

Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
2022-10-17 08:26:36 +02:00
Stephen Hemminger
04d59ab2cf rwlock: promote trylock operations as stable
These have been in for since 19.02, time to take off the
experimental tag.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
2022-10-27 13:00:11 +02:00
Stephen Hemminger
44830cc082 log: promote rte_log_list_types as stable
This call was added in 21.05 so time to make it stable.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
2022-10-27 12:59:19 +02:00
Stephen Hemminger
23ce0afd37 eal: promote interruptible epoll wait as stable
This call was added in 20.11, so time to make it not experimental.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Marchand <david.marchand@redhat.com>
2022-10-27 12:58:02 +02:00
Stephen Hemminger
d2e3c4b8a2 pcapng: record received RSS hash in pcap file
There is an option for recording RSS hash with packets in the
pcapng standard. This implements this for all received packets.

There is a corner case that can not be addressed with current
DPDK API's. If using rte_flow() and some hardware it is possible
to write a flow rule that uses another hash function like XOR.
But there is no API that records this, or provides the algorithm
info on a per-packet basis.

Wireshark recently merged support for displaying the recorded hash
option (for, yet to be released, version 4.1).

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Tested-by: Ben Magistro <koncept1@gmail.com>
2022-10-27 10:29:59 +02:00
Markus Theil
e6b42038e8 power: fix P-state number parsing
When converting atoi to strtol in a revision
of introducing sysfs support for turbo percentage,
a necessary check against '\n' returned by sysfs
was not introduced.

Fixes: de254dac60 ("power: read P-state turbo percentage from sysfs")

Signed-off-by: Markus Theil <markus.theil@secunet.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
2022-10-26 23:36:56 +02:00
Tadhg Kearney
b127e74cce power: fix open file descriptors leak
Close file pointers to Intel uncore sysfiles.

Coverity issue: 381400 381397
Fixes: 60b8a661a9 ("power: add Intel uncore frequency control")

Signed-off-by: Tadhg Kearney <tadhg.kearney@intel.com>
Reviewed-by: Reshma Pattan <reshma.pattan@intel.com>
2022-10-26 23:31:17 +02:00
Ali Alnubani
16de054160 lib: remove empty return types from doxygen comments
Recent versions of doxygen (1.9.4 and newer) complain about
documented return types for functions that don't return anything.

This patch removes these return types to fix build errors similar
to this one:
[..]
Generating doc/api/doxygen with a custom command
FAILED: doc/api/html
/usr/bin/python3 /path/to/doc/api/generate_doxygen.py doc/api/html
  /usr/bin/doxygen doc/api/doxy-api.conf
/root/dpdk/lib/eal/include/rte_bitmap.h:324: error: found documented
  return type for rte_bitmap_prefetch0 that does not return anything
  (warning treated as error, aborting now)
[..]

Tested with doxygen versions: 1.8.13, 1.8.17, 1.9.1, and 1.9.4.

Signed-off-by: Ali Alnubani <alialnu@nvidia.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
2022-10-26 17:51:51 +02:00
Kumara Parameshwaran
72f51b097a gro: check payload length after trim
When packet is padded with extra bytes the
the validation of the payload length should be done
after the trim operation

Fixes: b8a55871d5 ("gro: trim tail padding bytes")
Cc: stable@dpdk.org

Signed-off-by: Kumara Parameshwaran <kumaraparamesh92@gmail.com>
Acked-by: Jiayu Hu <jiayu.hu@intel.com>
2022-10-26 17:18:11 +02:00
Andrew Rybchenko
e6e62f6f55 mempool: flush cache completely on overflow
The cache was still full after flushing. In the opposite direction,
i.e. when getting objects from the cache, the cache is refilled to full
level when it crosses the low watermark (which happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

This bug degraded performance caused by too frequent flushing.

Consider this application scenario:

Either, an lcore thread in the application is in a state of balance,
where it uses the mempool cache within its flush/refill boundaries; in
this situation, the flush method is less important, and this fix is
irrelevant.

Or, an lcore thread in the application is out of balance (either
permanently or temporarily), and mostly gets or puts objects from/to the
mempool. If it mostly puts objects, not flushing all of the objects will
cause more frequent flushing. This is the scenario addressed by this
fix. E.g.:

Cache size=256, flushthresh=384 (1.5x size), initial len=256;
application burst len=32.

If there are "size" objects in the cache after flushing, the cache is
flushed at every 4th burst.

If the cache is flushed completely, the cache is only flushed at every
16th burst.

As you can see, this bug caused the cache to be flushed 4x too
frequently in this example.

And when/if the application thread breaks its pattern of continuously
putting objects, and suddenly starts to get objects instead, it will
either get objects already in the cache, or the get() function will
refill the cache.

The concept of not flushing the cache completely was probably based on
an assumption that it is more likely for an application's lcore thread
to get() after flushing than to put() after flushing.
I strongly disagree with this assumption! If an application thread is
continuously putting so much that it overflows the cache, it is much
more likely to keep putting than it is to start getting. If in doubt,
consider how CPU branch predictors work: When the application has done
something many times consecutively, the branch predictor will expect the
application to do the same again, rather than suddenly do something
else.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2022-10-26 12:10:33 +02:00
Morten Brørup
459531c958 mempool: fix cache flushing algorithm
Fix the rte_mempool_do_generic_put() caching flushing algorithm to
keep hot objects in cache instead of cold ones.

The algorithm was:
 1. Add the objects to the cache.
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the backend.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
the cache full after flushing, which the above description reflects.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush some cached objects to the backend to
    free up required space.
 2. Add the objects to the cache.

The most recent (hot) objects were flushed, leaving the oldest (cold)
objects in the mempool cache. The bug degraded performance, because
flushing prevented immediate reuse of the (hot) objects already in
the CPU cache.  Now, the existing (cold) objects in the mempool cache
are flushed before the new (hot) objects are added the to the mempool
cache.

Since nearby code is touched anyway fix flush threshold comparison
to do flushing if the threshold is really exceed, not just reached.
I.e. it must be "len > flushthresh", not "len >= flushthresh".
Consider a flush multiplier of 1 instead of 1.5; the cache would be
flushed already when reaching size objects, not when exceeding size
objects. In other words, the cache would not be able to hold "size"
objects, which is clearly a bug. The bug could degraded performance
due to premature flushing.

Since we never exceed flush threshold now, cache size in the mempool
may be decreased from RTE_MEMPOOL_CACHE_MAX_SIZE * 3 to
RTE_MEMPOOL_CACHE_MAX_SIZE * 2. In fact it could be
CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE), but flush
threshold multiplier is internal.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
2022-10-26 12:09:13 +02:00
Naga Harish K S V
75c5bfc320 eventdev/eth_tx: fix queue delete
To delete all the queues of an ethdev device associated with
adapter instance the queue_id can be passed as -1 to the queue
delete API.

When a subset of queues of a ethdev device are associated,
the queue delete logic is exiting without deleting the queues
in some cases (higher numbered associated queues) for above
scenario as the queue delete logic is not checking all the
queue association status.

This patch fixes this issue by checking the queue association
status of all the queues of the ethernet device.

Fixes: 741b499e64 ("eventdev/eth_tx: fix queue delete logic")
Cc: stable@dpdk.org

Signed-off-by: Naga Harish K S V <s.v.naga.harish.k@intel.com>
2022-10-21 11:42:08 +02:00
Ganapati Kundapura
8f4ff7de39 eventdev/crypto: fix multi-process
Secondary process is not able to call the crypto adapter
APIs stats get/reset as crypto adapter memzone memory
is not accessible by secondary process.

Added memzone lookup so that secondary process can call the
crypto adapter APIs(stats_get etc)

Fixes: 7901eac340 ("eventdev: add crypto adapter implementation")
Cc: stable@dpdk.org

Signed-off-by: Ganapati Kundapura <ganapati.kundapura@intel.com>
Acked-by: Abhinandan Gujjar <abhinandan.gujjar@intel.com>
2022-10-21 11:42:08 +02:00
Pavan Nikhilesh
1bdfe4d76e eventdev: increase xstats ID width to 64 bits
Increase xstats ID width from 32 to 64 bits. This also
fixes the xstats ID datatype discrepancy between reset and
rest of the xstats family.

Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Reviewed-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
2022-10-21 11:42:08 +02:00
Mattias Rönnblom
ed88c5a5e4 eventdev/timer: support appropriately report idle
Update the Event Timer Adapter's service function to report as idle
(i.e., return -EAGAIN) in case no timer events were enqueued to the
event device.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Erik Gabriel Carrillo <erik.g.carrillo@intel.com>
2022-10-21 11:34:42 +02:00
Mattias Rönnblom
35d052356b eventdev/eth_tx: support appropriately report idle
Update the Event Ethernet Tx Adapter's service function to report as
idle (i.e., return -EAGAIN) in case no events were dequeued from the
event device and no Ethernet frames were sent out on the wire.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Reviewed-by: Naga Harish K S V <s.v.naga.harish.k@intel.com>
Acked-by: Jay Jayatheerthan <jay.jayatheerthan@intel.com>
2022-10-21 11:34:41 +02:00
Mattias Rönnblom
7f33abd49b eventdev/eth_rx: support appropriately report idle
Update the Event Ethernet Rx Adapter's service function to report as
idle (i.e., return -EAGAIN) in case no Ethernet frames were received
from the ethdev and no events were enqueued to the event device.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Reviewed-by: Naga Harish K S V <s.v.naga.harish.k@intel.com>
Acked-by: Jay Jayatheerthan <jay.jayatheerthan@intel.com>
2022-10-21 11:34:41 +02:00
Mattias Rönnblom
34d785571f eventdev/crypto: support appropriately report idle
Update the event crypto adapter's service function to report as idle
(i.e., return -EAGAIN) in case no crypto operations were performed.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Abhinandan Gujjar <abhinandan.gujjar@intel.com>
2022-10-21 11:34:41 +02:00
Erik Gabriel Carrillo
329280c53e service: fix early move to inactive status
Assume thread T2 is a service lcore that is in the middle of executing
a service function.  Also, assume thread T1 concurrently calls
rte_service_lcore_stop(), which will set the "service_active_on_lcore"
state to false.  If thread T1 then calls rte_service_may_be_active(),
it can return zero even though T2 is still running the service function.
If T1 then proceeds to free data being used by T2, a crash can ensue.

Move the logic that clears the "service_active_on_lcore" state from the
rte_service_lcore_stop() function to the service_runner_func() to
ensure that we:
- don't let the "service_active_on_lcore" state linger as 1
- don't clear the state early

Fixes: 6550113be6 ("service: fix lingering active status")
Cc: stable@dpdk.org

Signed-off-by: Erik Gabriel Carrillo <erik.g.carrillo@intel.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
2022-10-21 14:54:26 +02:00
Stephen Hemminger
8a0cf0c455 pdump: do not allow enable/disable in primary process
Attempts to enable or disable pdump in primary process
will fail with core dump because it is not valid to call
rte_mp_request_sync() unless in a secondary process.

Trap the error in the common code used for both enable
and disable requests.

Fixes: 660098d61f ("pdump: use generic multi-process channel")
Cc: stable@dpdk.org

Reported-by: Sylvia Grundwürmer <sylvia.grundwuermer@b-plus.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2022-10-21 14:54:26 +02:00
David Marchand
eb870201b4 trace: remove limitation on directory
Remove arbitrary limit on 12 characters of the file prefix used for the
directory where to store the traces.
Simplify the code by relying on dynamic allocations.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Sunil Kumar Kori <skori@marvell.com>
2022-10-20 13:34:19 +02:00
David Marchand
477cc313a2 trace: remove limitation on trace point name
The name of a trace point is provided as a constant string via the
RTE_TRACE_POINT_REGISTER macro.
We can rely on an explicit constant string in the binary and simply point
at it.
There is then no need for a (fixed size) copy.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
2022-10-20 13:34:19 +02:00
David Marchand
d4cbbee345 trace: fix metadata dump
The API does not describe that metadata dump is conditioned to enabling
any trace points.

While at it, merge dump unit tests into the generic trace_autotest to
enhance coverage.

Fixes: f6b2d65dcd ("trace: implement debug dump")
Cc: stable@dpdk.org

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Sunil Kumar Kori <skori@marvell.com>
2022-10-20 13:34:19 +02:00
David Marchand
782dbf1791 trace: fix race in debug dump
trace->nb_trace_mem_list access must be under trace->lock to avoid
races with threads allocating/freeing their trace buffers.

Fixes: f6b2d65dcd ("trace: implement debug dump")
Cc: stable@dpdk.org

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Sunil Kumar Kori <skori@marvell.com>
2022-10-20 13:34:19 +02:00
David Marchand
d6fd5a018e trace: fix dynamically enabling trace points
Enabling trace points at runtime was not working if no trace point had
been enabled first at rte_eal_init() time. The reason was that
trace.args reflected the arguments passed to --trace= EAL option.

To fix this:
- the trace subsystem initialisation is updated: trace directory
  creation is deferred to when traces are dumped (to avoid creating
  directories that may not be used),
- per lcore memory allocation still relies on rte_trace_is_enabled() but
  this helper now tracks if any trace point is enabled. The
  documentation is updated accordingly,
- cleanup helpers must always be called in rte_eal_cleanup() since some
  trace points might have been enabled and disabled in the lifetime of
  the DPDK application,

With this fix, we can update the unit test and check that a trace point
callback is invoked when expected.

Note:
- the 'trace' global variable might be shadowed with the argument
  passed to the functions dealing with trace point handles.
  'tp' has been used for referring to trace_point object.
  Prefer 't' for referring to handles,

Fixes: 84c4fae462 ("trace: implement operation APIs")
Cc: stable@dpdk.org

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Sunil Kumar Kori <skori@marvell.com>
2022-10-20 13:34:19 +02:00
David Marchand
3ee927d3e4 trace: rework loop on trace points
Directly skip the block when a trace point does not match the user
criteria.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Sunil Kumar Kori <skori@marvell.com>
2022-10-20 13:34:19 +02:00
David Marchand
b980ced067 trace: fix leak with regexp
The precompiled buffer initialised in regcomp must be freed before
leaving rte_trace_regexp.

Fixes: 84c4fae462 ("trace: implement operation APIs")
Cc: stable@dpdk.org

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Sunil Kumar Kori <skori@marvell.com>
2022-10-20 13:34:19 +02:00
David Marchand
1559663872 trace: fix mode change
The API does not state that changing mode should be refused if no trace
point is enabled. Remove this limitation.

Fixes: 84c4fae462 ("trace: implement operation APIs")
Cc: stable@dpdk.org

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Sunil Kumar Kori <skori@marvell.com>
2022-10-20 13:34:19 +02:00
David Marchand
12b627bf77 trace: fix mode for new trace point
If an application registers trace points later than rte_eal_init(),
changes in the trace point mode were not applied.

Fixes: 84c4fae462 ("trace: implement operation APIs")
Cc: stable@dpdk.org

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Sunil Kumar Kori <skori@marvell.com>
2022-10-20 13:34:19 +02:00
Nicolas Chautru
a53a025b45 bbdev: fix build with clang 3.4.2
Casting explicitly from enum to uint8_t to avoid compilation
warning with clang 3.4.2:

  rte_bbdev.c:1179:13: error:
  comparison of constant 4 with expression
  of type 'enum rte_bbdev_enqueue_status' is always true
  [-Werror,-Wtautological-constant-out-of-range-compare]

Bugzilla ID: 1095
Fixes: 1be86f2e94 ("bbdev: add device status info")
Fixes: 4f08028c5e ("bbdev: expose queue related warning and status")

Signed-off-by: Nicolas Chautru <nicolas.chautru@intel.com>
Tested-by: Ali Alnubani <alialnu@nvidia.com>
2022-10-11 01:34:07 +02:00
Shiqi Liu
d914c01036 node: check Rx element allocation
As the possible failure of the malloc(), the not_checked and
checked could be NULL pointer.
Therefore, it should be better to check it in order to avoid
the dereference of the NULL pointer.

Fixes: fa8054c8c8 ("examples/eventdev: add thread safe Tx worker pipeline")
Cc: stable@dpdk.org

Signed-off-by: Shiqi Liu <835703180@qq.com>
2022-10-10 17:53:12 +02:00
Zhirun Yan
afe67d1414 graph: fix node objects allocation
For __rte_node_enqueue_prologue(), if the number of objs is more than
the node->size * 2, the extra objs will write out of bounds memory.
It should use __rte_node_stream_alloc_size() to request enough memory.

And for rte_node_next_stream_put(), it will re-allocate a small size,
when the node free space is small and new objs is less than the current
node->size. Some objs pointers behind new size may be lost. And it will
cause memory leak. It should request enough size of memory, containing
the original objs and new objs at least.

Fixes: 40d4f51403 ("graph: implement fastpath routines")
Cc: stable@dpdk.org

Signed-off-by: Zhirun Yan <zhirun.yan@intel.com>
Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
2022-10-10 17:30:39 +02:00
Andrew Rybchenko
90cf759aaf mempool: avoid usage of term ring on put
Term ring is misleading since it is the default,
but still just one of possible drivers to store objects.

Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
2022-10-10 17:24:22 +02:00
Andrew Rybchenko
e3f138aa91 mempool: check driver enqueue result in one place
Enqueue operation must not fail. Move corresponding debug check
from one particular case to dequeue operation helper in order
to do it for all invocations.

Log critical message with useful information instead of rte_panic().

Make rte_mempool_do_generic_put() implementation more readable and
fix incosistency when return value is not checked in one place and
checked in another.

Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
2022-10-10 17:17:48 +02:00
Bruce Richardson
66f624e4ea kni: add deprecation warning at runtime
When KNI is being used at runtime, output a warning message about its
deprecated status. This is part of the deprecation process for KNI
agreed by the DPDK technical board.[1]

[1] https://mails.dpdk.org/archives/dev/2022-June/243596.html

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
2022-10-10 17:04:09 +02:00
Bruce Richardson
bbaf917565 kni: flag deprecated status at build time
To ensure all users are aware of KNI's deprecated status at build time,
this library is marked as a deprecated library: the library is disabled
by default. It can be re-enabled by setting disabled_libs to the empty
string (or other string not including 'kni').

The dependent NIC driver, drivers/net/kni, is disabled accordingly as it
depends on the library.

NOTE: This is part of the deprecation process for KNI agreed by the DPDK
technical board.[1]

[1] https://mails.dpdk.org/archives/dev/2022-June/243596.html

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
2022-10-10 17:01:59 +02:00
Bruce Richardson
dfd5b25b57 build: introduce deprecated libraries
Add support for a list of deprecated libs to the lib/meson.build file.
This will be used to mark libraries that are planned to be removed from
DPDK. The first user of this will be KNI in a next patch.

Deprecated libraries should still be tested in the CI, so update our
build testing and CI scripts.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
2022-10-10 17:01:56 +02:00
Dmitry Kozlyuk
03b3cdf9c2 mempool: make event callbacks process-private
Callbacks for mempool events were registered in a process-shared tailq.
This was inherently incorrect because the same function
may be loaded to a different address in each process.
Make the tailq process-private.
Use the EAL tailq lock to reduce the number of different locks
this module operates.

Fixes: da2b9cb25e ("mempool: add event callbacks")
Cc: stable@dpdk.org

Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
2022-10-10 16:38:03 +02:00
Tadhg Kearney
60b8a661a9 power: add Intel uncore frequency control
Add API to allow uncore frequency adjustment.

Uncore is a term used by Intel to describe function
of a microprocessor that are closely connected
to the core to achieve high performance.

This is done through manipulating related uncore frequency control
sysfs entries to adjust the minimum and maximum uncore frequency values
and works on Linux for Intel hardware.

Signed-off-by: Tadhg Kearney <tadhg.kearney@intel.com>
Reviewed-by: David Hunt <david.hunt@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
2022-10-10 14:53:40 +02:00
Leyi Rong
373b51ef02 member: fix build with GCC 5.4.0
This patch fixes the build failure by typecasting to match
_mm512_i32gather_epi64() definition.

Bugzilla ID: 1096
Fixes: db354bd2e1 ("member: add NitroSketch mode")

Signed-off-by: Leyi Rong <leyi.rong@intel.com>
Tested-by: Ali Alnubani <alialnu@nvidia.com>
2022-10-10 12:20:01 +02:00
Markus Theil
de254dac60 power: read P-state turbo percentage from sysfs
If DPDK applications should be used with a minimal set of privileges,
using the msr kernel module on linux should not be necessary.

Since at least kernel 4.4 the rdmsr call to obtain the last non-turbo
boost frequency can be left out, if the sysfs interface is used.
Also RHEL 7 with recent kernel updates should include the sysfs interface
for this (I only looked this up for CentOS 7).

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Tested-by: David Hunt <david.hunt@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
2022-10-10 02:52:26 +02:00
Mário Kuka
0744f1c9f9 pcapng: fix write more packets than IOV_MAX limit
The rte_pcapng_write_packets() function fails when we try to write more
packets than the IOV_MAX limit. writev() system call is limited by the
IOV_MAX limit. The iovcnt argument is valid if it is greater than 0 and
less than or equal to IOV_MAX as defined in <limits.h>.

To avoid this problem, we can check that all segments of the next
packet will fit into the iovec buffer, whose capacity will be limited
by the IOV_MAX limit. If not, we flush the current iovec buffer to the
file by calling writev() and, if successful, fit the current packet at
the beginning of the flushed iovec buffer.

Fixes: 8d23ce8f5e ("pcapng: add new library for writing pcapng files")
Cc: stable@dpdk.org

Signed-off-by: Mário Kuka <kuka@cesnet.cz>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
2022-10-10 02:42:36 +02:00
Stephen Hemminger
668958f3c1 eal: fix data race in multi-process support
If DPDK is built with thread sanitizer it reports a race
in setting of multiprocess file descriptor. The fix is to
use atomic operations when updating mp_fd.

Build:
$ meson -Db_sanitize=address build
$ ninja -C build

Simple example:
$ .build/app/dpdk-testpmd -l 1-3 --no-huge
EAL: Detected CPU lcores: 16
EAL: Detected NUMA nodes: 1
EAL: Static memory layout is selected, amount of reserved memory can be adjusted with -m or --socket-mem
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /run/user/1000/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
testpmd: No probed ethernet devices
testpmd: create a new mbuf pool <mb_pool_0>: n=163456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
EAL: Error - exiting with code: 1
  Cause: Creation of mbuf pool for socket 0 failed: Cannot allocate memory
==================
WARNING: ThreadSanitizer: data race (pid=87245)
  Write of size 4 at 0x558e04d8ff70 by main thread:
    #0 rte_mp_channel_cleanup <null> (dpdk-testpmd+0x1e7d30c)
    #1 rte_eal_cleanup <null> (dpdk-testpmd+0x1e85929)
    #2 rte_exit <null> (dpdk-testpmd+0x1e5bc0a)
    #3 mbuf_pool_create.cold <null> (dpdk-testpmd+0x274011)
    #4 main <null> (dpdk-testpmd+0x5cc15d)

  Previous read of size 4 at 0x558e04d8ff70 by thread T2:
    #0 mp_handle <null> (dpdk-testpmd+0x1e7c439)
    #1 ctrl_thread_init <null> (dpdk-testpmd+0x1e6ee1e)

  As if synchronized via sleep:
    #0 nanosleep libsanitizer/tsan/tsan_interceptors_posix.cpp:366
    #1 get_tsc_freq <null> (dpdk-testpmd+0x1e92ff9)
    #2 set_tsc_freq <null> (dpdk-testpmd+0x1e6f2fc)
    #3 rte_eal_timer_init <null> (dpdk-testpmd+0x1e931a4)
    #4 rte_eal_init.cold <null> (dpdk-testpmd+0x29e578)
    #5 main <null> (dpdk-testpmd+0x5cbc45)

  Location is global 'mp_fd' of size 4 at 0x558e04d8ff70 (dpdk-testpmd+0x000003122f70)

  Thread T2 'rte_mp_handle' (tid=87248, running) created by main thread at:
    #0 pthread_create libsanitizer/tsan/tsan_interceptors_posix.cpp:969
    #1 rte_ctrl_thread_create <null> (dpdk-testpmd+0x1e6efd0)
    #2 rte_mp_channel_init.cold <null> (dpdk-testpmd+0x29cb7c)
    #3 rte_eal_init <null> (dpdk-testpmd+0x1e8662e)
    #4 main <null> (dpdk-testpmd+0x5cbc45)

SUMMARY: ThreadSanitizer: data race (app/dpdk-testpmd+0x1e7d30c) in rte_mp_channel_cleanup
==================
ThreadSanitizer: reported 1 warnings

Fixes: bacaa27540 ("eal: add channel for multi-process communication")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Chengwen Feng <fengchengwen@huawei.com>
2022-10-10 01:58:31 +02:00
Leyi Rong
db354bd2e1 member: add NitroSketch mode
Sketching algorithm provide high-fidelity approximate measurements and
appears as a promising alternative to traditional approaches such as
packet sampling.

NitroSketch [1] is a software sketching framework that optimizes
performance, provides accuracy guarantees, and supports a variety of
sketches.

This commit adds a new data structure called sketch into
membership library. This new data structure is an efficient
way to profile the traffic for heavy hitters. Also use min-heap
structure to maintain the top-k flow keys.

[1] Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, Vladimir
Braverman, Roy Friedman, Vyas Sekar, "NitroSketch: Robust and General
Sketch-based Monitoring in Software Switches", in ACM SIGCOMM 2019.
https://dl.acm.org/doi/pdf/10.1145/3341302.3342076

Signed-off-by: Alan Liu <zaoxingliu@gmail.com>
Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
Signed-off-by: Leyi Rong <leyi.rong@intel.com>
Tested-by: Yu Jiang <yux.jiang@intel.com>
2022-10-09 23:11:43 +02:00
Yuan Wang
605975b8b3 ethdev: introduce protocol-based buffer split
Currently, Rx buffer split supports length based split. With Rx queue
offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx packet segment
configured, PMD will be able to split the received packets into
multiple segments.

However, length based buffer split is not suitable for NICs that do split
based on protocol headers. Given an arbitrarily variable length in Rx
packet segment, it is almost impossible to pass a fixed protocol header to
driver. Besides, the existence of tunneling results in the composition of
a packet is various, which makes the situation even worse.

This patch extends current buffer split to support protocol header based
buffer split. A new proto_hdr field is introduced in the reserved field
of rte_eth_rxseg_split structure to specify protocol header. The proto_hdr
field defines the split position of packet, splitting will always happen
after the protocol header defined in the Rx packet segment. When Rx queue
offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding
protocol header is configured, driver will split the ingress packets into
multiple segments.

Examples for proto_hdr field defines:
To split after ETH-IPV4-UDP, it should be defined as
proto_hdr = RTE_PTYPE_L2_ETHER | RTE_PTYPE_L3_IPV4_EXT_UNKNOWN |
            RTE_PTYPE_L4_UDP

For inner ETH-IPV4-UDP, it should be defined as
proto_hdr = RTE_PTYPE_TUNNEL_GRENAT | RTE_PTYPE_INNER_L2_ETHER |
            RTE_PTYPE_INNER_L3_IPV4_EXT_UNKNOWN | RTE_PTYPE_INNER_L4_UDP

If the protocol header is repeated with the previously defined one,
the repeated part should be omitted. For example, split after ETH, ETH-IPV4
and ETH-IPV4-UDP, it should be defined as
proto_hdr0 = RTE_PTYPE_L2_ETHER
proto_hdr1 = RTE_PTYPE_L3_IPV4_EXT_UNKNOWN
proto_hdr2 = RTE_PTYPE_L4_UDP

If protocol header split can be supported by a PMD, the
rte_eth_buffer_split_get_supported_hdr_ptypes function can
be used to obtain a list of these protocol headers.

For example, let's suppose we configured the Rx queue with the
following segments:
        seg0 - pool0, proto_hdr0=RTE_PTYPE_L2_ETHER | RTE_PTYPE_L3_IPV4,
               off0=2B
        seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B
        seg2 - pool2, proto_hdr2=0, off1=0B

The packet consists of ETH_IPV4_UDP_PAYLOAD will be split like
following:
        seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
        seg1 - udp header @ 128 in mbuf from pool1
        seg2 - payload @ 0 in mbuf from pool2

Now buffer split can be configured in two modes. User can choose length
or protocol header to configure buffer split according to NIC's
capability. For length based buffer split, the mp, length, offset field
in Rx packet segment should be configured, while the proto_hdr field
must be 0. For protocol header based buffer split, the mp, offset,
proto_hdr field in Rx packet segment should be configured, while the
length field must be 0.

Note: When protocol header split is enabled, NIC may receive packets
which do not match all the protocol headers within the Rx segments.
At this point, NIC will have two possible split behaviors according to
matching results, one is exact match, another is longest match.
The split result of NIC must belong to one of them.

The exact match means NIC only do split when the packets exactly match all
the protocol headers in the segments. Otherwise, the whole packet will be
put into the last valid mempool. The longest match means NIC will do split
until packets mismatch the protocol header in the segments. The rest will
be put into the last valid pool.

Pseudo-code for exact match:
FOR each seg in segs except last one
    IF proto_hdr is not matched THEN
        BREAK
    END IF
END FOR
IF loop breaked THEN
    put whole pkt in last seg
ELSE
    put protocol header in each seg
    put everything else in last seg
END IF

Pseudo-code for longest match:
FOR each seg in segs except last one
    IF proto_hdr is matched THEN
        put protocol header in seg
    ELSE
        BREAK
    END IF
END FOR
put everything else in last seg

Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
Signed-off-by: Xuan Ding <xuan.ding@intel.com>
Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
2022-10-09 16:41:27 +02:00
Yuan Wang
e4e6f4cbf9 ethdev: introduce protocol header API
Add a new ethdev API to retrieve supported protocol headers
of a PMD, which helps to configure protocol header based buffer split.

Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
Signed-off-by: Xuan Ding <xuan.ding@intel.com>
Signed-off-by: Wenxuan Wu <wenxuanx.wu@intel.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
2022-10-09 16:41:24 +02:00