doc: add graph library guide
Adding programmer's guide for Graph library and the inbuilt nodes. This patch also updates the release note for the new libraries. Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
This commit is contained in:
parent
3087ec1cf4
commit
4dc6d8e63c
@ -1476,6 +1476,7 @@ Graph - EXPERIMENTAL
|
||||
M: Jerin Jacob <jerinj@marvell.com>
|
||||
M: Kiran Kumar K <kirankumark@marvell.com>
|
||||
F: lib/librte_graph/
|
||||
F: doc/guides/prog_guide/graph_lib.rst
|
||||
F: app/test/test_graph*
|
||||
M: Nithin Dabilpuram <ndabilpuram@marvell.com>
|
||||
F: examples/l3fwd-graph/
|
||||
|
397
doc/guides/prog_guide/graph_lib.rst
Normal file
397
doc/guides/prog_guide/graph_lib.rst
Normal file
@ -0,0 +1,397 @@
|
||||
.. SPDX-License-Identifier: BSD-3-Clause
|
||||
Copyright(C) 2020 Marvell International Ltd.
|
||||
|
||||
Graph Library and Inbuilt Nodes
|
||||
===============================
|
||||
|
||||
Graph architecture abstracts the data processing functions as a ``node`` and
|
||||
``links`` them together to create a complex ``graph`` to enable reusable/modular
|
||||
data processing functions.
|
||||
|
||||
The graph library provides API to enable graph framework operations such as
|
||||
create, lookup, dump and destroy on graph and node operations such as clone,
|
||||
edge update, and edge shrink, etc. The API also allows to create the stats
|
||||
cluster to monitor per graph and per node stats.
|
||||
|
||||
Features
|
||||
--------
|
||||
|
||||
Features of the Graph library are:
|
||||
|
||||
- Nodes as plugins.
|
||||
- Support for out of tree nodes.
|
||||
- Inbuilt nodes for packet processing.
|
||||
- Multi-process support.
|
||||
- Low overhead graph walk and node enqueue.
|
||||
- Low overhead statistics collection infrastructure.
|
||||
- Support to export the graph as a Graphviz dot file. See ``rte_graph_export()``.
|
||||
- Allow having another graph walk implementation in the future by segregating
|
||||
the fast path(``rte_graph_worker.h``) and slow path code.
|
||||
|
||||
Advantages of Graph architecture
|
||||
--------------------------------
|
||||
|
||||
- Memory latency is the enemy for high-speed packet processing, moving the
|
||||
similar packet processing code to a node will reduce the I cache and D
|
||||
caches misses.
|
||||
- Exploits the probability that most packets will follow the same nodes in the
|
||||
graph.
|
||||
- Allow SIMD instructions for packet processing of the node.-
|
||||
- The modular scheme allows having reusable nodes for the consumers.
|
||||
- The modular scheme allows us to abstract the vendor HW specific
|
||||
optimizations as a node.
|
||||
|
||||
Performance tuning parameters
|
||||
-----------------------------
|
||||
|
||||
- Test with various burst size values (256, 128, 64, 32) using
|
||||
CONFIG_RTE_GRAPH_BURST_SIZE config option.
|
||||
The testing shows, on x86 and arm64 servers, The sweet spot is 256 burst
|
||||
size. While on arm64 embedded SoCs, it is either 64 or 128.
|
||||
- Disable node statistics (using ``CONFIG_RTE_LIBRTE_GRAPH_STATS`` config option)
|
||||
if not needed.
|
||||
- Use arm64 optimized memory copy for arm64 architecture by
|
||||
selecting ``CONFIG_RTE_ARCH_ARM64_MEMCPY``.
|
||||
|
||||
Programming model
|
||||
-----------------
|
||||
|
||||
Anatomy of Node:
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
.. _figure_anatomy_of_a_node:
|
||||
|
||||
.. figure:: img/anatomy_of_a_node.*
|
||||
|
||||
The :numref:`figure_anatomy_of_a_node` diagram depicts the anatomy of a node.
|
||||
|
||||
The node is the basic building block of the graph framework.
|
||||
|
||||
A node consists of:
|
||||
|
||||
process():
|
||||
^^^^^^^^^^
|
||||
|
||||
The callback function will be invoked by worker thread using
|
||||
``rte_graph_walk()`` function when there is data to be processed by the node.
|
||||
A graph node process the function using ``process()`` and enqueue to next
|
||||
downstream node using ``rte_node_enqueue*()`` function.
|
||||
|
||||
Context memory:
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
It is memory allocated by the library to store the node-specific context
|
||||
information. This memory will be used by process(), init(), fini() callbacks.
|
||||
|
||||
init():
|
||||
^^^^^^^
|
||||
|
||||
The callback function will be invoked by ``rte_graph_create()`` on when
|
||||
a node gets attached to a graph.
|
||||
|
||||
fini():
|
||||
^^^^^^^
|
||||
|
||||
The callback function will be invoked by ``rte_graph_destroy()`` on when a
|
||||
node gets detached to a graph.
|
||||
|
||||
Node name:
|
||||
^^^^^^^^^^
|
||||
|
||||
It is the name of the node. When a node registers to graph library, the library
|
||||
gives the ID as ``rte_node_t`` type. Both ID or Name shall be used lookup the
|
||||
node. ``rte_node_from_name()``, ``rte_node_id_to_name()`` are the node
|
||||
lookup functions.
|
||||
|
||||
nb_edges:
|
||||
^^^^^^^^^
|
||||
|
||||
The number of downstream nodes connected to this node. The ``next_nodes[]``
|
||||
stores the downstream nodes objects. ``rte_node_edge_update()`` and
|
||||
``rte_node_edge_shrink()`` functions shall be used to update the ``next_node[]``
|
||||
objects. Consumers of the node APIs are free to update the ``next_node[]``
|
||||
objects till ``rte_graph_create()`` invoked.
|
||||
|
||||
next_node[]:
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The dynamic array to store the downstream nodes connected to this node. Downstream
|
||||
node should not be current node itself or a source node.
|
||||
|
||||
Source node:
|
||||
^^^^^^^^^^^^
|
||||
|
||||
Source nodes are static nodes created using ``RTE_NODE_REGISTER`` by passing
|
||||
``flags`` as ``RTE_NODE_SOURCE_F``.
|
||||
While performing the graph walk, the ``process()`` function of all the source
|
||||
nodes will be called first. So that these nodes can be used as input nodes for a graph.
|
||||
|
||||
Node creation and registration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
* Node implementer creates the node by implementing ops and attributes of
|
||||
``struct rte_node_register``.
|
||||
|
||||
* The library registers the node by invoking RTE_NODE_REGISTER on library load
|
||||
using the constructor scheme. The constructor scheme used here to support multi-process.
|
||||
|
||||
Link the Nodes to create the graph topology
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
.. _figure_link_the_nodes:
|
||||
|
||||
.. figure:: img/link_the_nodes.*
|
||||
|
||||
The :numref:`figure_link_the_nodes` diagram shows a graph topology after
|
||||
linking the N nodes.
|
||||
|
||||
Once nodes are available to the program, Application or node public API
|
||||
functions can links them together to create a complex packet processing graph.
|
||||
|
||||
There are multiple different types of strategies to link the nodes.
|
||||
|
||||
Method (a):
|
||||
^^^^^^^^^^^
|
||||
Provide the ``next_nodes[]`` at the node registration time. See ``struct rte_node_register::nb_edges``.
|
||||
This is a use case to address the static node scheme where one knows upfront the
|
||||
``next_nodes[]`` of the node.
|
||||
|
||||
Method (b):
|
||||
^^^^^^^^^^^
|
||||
Use ``rte_node_edge_get()``, ``rte_node_edge_update()``, ``rte_node_edge_shrink()``
|
||||
to update the ``next_nodes[]`` links for the node runtime but before graph create.
|
||||
|
||||
Method (c):
|
||||
^^^^^^^^^^^
|
||||
Use ``rte_node_clone()`` to clone a already existing node, created using RTE_NODE_REGISTER.
|
||||
When ``rte_node_clone()`` invoked, The library, would clone all the attributes
|
||||
of the node and creates a new one. The name for cloned node shall be
|
||||
``"parent_node_name-user_provided_name"``.
|
||||
|
||||
This method enables the use case of Rx and Tx nodes where multiple of those nodes
|
||||
need to be cloned based on the number of CPU available in the system.
|
||||
The cloned nodes will be identical, except the ``"context memory"``.
|
||||
Context memory will have information of port, queue pair in case of Rx and Tx
|
||||
ethdev nodes.
|
||||
|
||||
Create the graph object
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Now that the nodes are linked, Its time to create a graph by including
|
||||
the required nodes. The application can provide a set of node patterns to
|
||||
form a graph object. The ``famish()`` API used underneath for the pattern
|
||||
matching to include the required nodes. After the graph create any changes to
|
||||
nodes or graph is not allowed.
|
||||
|
||||
The ``rte_graph_create()`` API shall be used to create the graph.
|
||||
|
||||
Example of a graph object creation:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
{"ethdev_rx-0-0", ip4*, ethdev_tx-*"}
|
||||
|
||||
In the above example, A graph object will be created with ethdev Rx
|
||||
node of port 0 and queue 0, all ipv4* nodes in the system,
|
||||
and ethdev tx node of all ports.
|
||||
|
||||
Multicore graph processing
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
In the current graph library implementation, specifically,
|
||||
``rte_graph_walk()`` and ``rte_node_enqueue*`` fast path API functions
|
||||
are designed to work on single-core to have better performance.
|
||||
The fast path API works on graph object, So the multi-core graph
|
||||
processing strategy would be to create graph object PER WORKER.
|
||||
|
||||
In fast path
|
||||
~~~~~~~~~~~~
|
||||
Typical fast-path code looks like below, where the application
|
||||
gets the fast-path graph object using ``rte_graph_lookup()``
|
||||
on the worker thread and run the ``rte_graph_walk()`` in a tight loop.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct rte_graph *graph = rte_graph_lookup("worker0");
|
||||
|
||||
while (!done) {
|
||||
rte_graph_walk(graph);
|
||||
}
|
||||
|
||||
Context update when graph walk in action
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
The fast-path object for the node is ``struct rte_node``.
|
||||
|
||||
It may be possible that in slow-path or after the graph walk-in action,
|
||||
the user needs to update the context of the node hence access to
|
||||
``struct rte_node *`` memory.
|
||||
|
||||
``rte_graph_foreach_node()``, ``rte_graph_node_get()``,
|
||||
``rte_graph_node_get_by_name()`` APIs can be used to to get the
|
||||
``struct rte_node*``. ``rte_graph_foreach_node()`` iterator function works on
|
||||
``struct rte_graph *`` fast-path graph object while others works on graph ID or name.
|
||||
|
||||
Get the node statistics using graph cluster
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
The user may need to know the aggregate stats of the node across
|
||||
multiple graph objects. Especially the situation where each graph object bound
|
||||
to a worker thread.
|
||||
|
||||
Introduced a graph cluster object for statistics.
|
||||
``rte_graph_cluster_stats_create()`` API shall be used for creating a
|
||||
graph cluster with multiple graph objects and ``rte_graph_cluster_stats_get()``
|
||||
to get the aggregate node statistics.
|
||||
|
||||
An example statistics output from ``rte_graph_cluster_stats_get()``
|
||||
|
||||
.. code-block:: diff
|
||||
|
||||
+---------+-----------+-------------+---------------+-----------+---------------+-----------+
|
||||
|Node |calls |objs |realloc_count |objs/call |objs/sec(10E6) |cycles/call|
|
||||
+---------------------+-------------+---------------+-----------+---------------+-----------+
|
||||
|node0 |12977424 |3322220544 |5 |256.000 |3047.151872 |20.0000 |
|
||||
|node1 |12977653 |3322279168 |0 |256.000 |3047.210496 |17.0000 |
|
||||
|node2 |12977696 |3322290176 |0 |256.000 |3047.221504 |17.0000 |
|
||||
|node3 |12977734 |3322299904 |0 |256.000 |3047.231232 |17.0000 |
|
||||
|node4 |12977784 |3322312704 |1 |256.000 |3047.243776 |17.0000 |
|
||||
|node5 |12977825 |3322323200 |0 |256.000 |3047.254528 |17.0000 |
|
||||
+---------+-----------+-------------+---------------+-----------+---------------+-----------+
|
||||
|
||||
Node writing guidelines
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``process()`` function of a node is the fast-path function and that needs
|
||||
to be written carefully to achieve max performance.
|
||||
|
||||
Broadly speaking, there are two different types of nodes.
|
||||
|
||||
Static nodes
|
||||
~~~~~~~~~~~~
|
||||
The first kind of nodes are those that have a fixed ``next_nodes[]`` for the
|
||||
complete burst (like ethdev_rx, ethdev_tx) and it is simple to write.
|
||||
``process()`` function can move the obj burst to the next node either using
|
||||
``rte_node_next_stream_move()`` or using ``rte_node_next_stream_get()`` and
|
||||
``rte_node_next_stream_put()``.
|
||||
|
||||
Intermediate nodes
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
The second kind of such node is ``intermediate nodes`` that decide what is the
|
||||
``next_node[]`` to send to on a per-packet basis. In these nodes,
|
||||
|
||||
* Firstly, there has to be the best possible packet processing logic.
|
||||
|
||||
* Secondly, each packet needs to be queued to its next node.
|
||||
|
||||
This can be done using ``rte_node_enqueue_[x1|x2|x4]()`` APIs if
|
||||
they are to single next or ``rte_node_enqueue_next()`` that takes array of nexts.
|
||||
|
||||
In scenario where multiple intermediate nodes are present but most of the time
|
||||
each node using the same next node for all its packets, the cost of moving every
|
||||
pointer from current node's stream to next node's stream could be avoided.
|
||||
This is called home run and ``rte_node_next_stream_move()`` could be used to
|
||||
just move stream from the current node to the next node with least number of cycles.
|
||||
Since this can be avoided only in the case where all the packets are destined
|
||||
to the same next node, node implementation should be also having worst-case
|
||||
handling where every packet could be going to different next node.
|
||||
|
||||
Example of intermediate node implementation with home run:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
1. Start with speculation that next_node = node->ctx.
|
||||
This could be the next_node application used in the previous function call of this node.
|
||||
|
||||
2. Get the next_node stream array with required space using
|
||||
``rte_node_next_stream_get(next_node, space)``.
|
||||
|
||||
3. while n_left_from > 0 (i.e packets left to be sent) prefetch next pkt_set
|
||||
and process current pkt_set to find their next node
|
||||
|
||||
4. if all the next nodes of the current pkt_set match speculated next node,
|
||||
just count them as successfully speculated(``last_spec``) till now and
|
||||
continue the loop without actually moving them to the next node. else if there is
|
||||
a mismatch, copy all the pkt_set pointers that were ``last_spec`` and move the
|
||||
current pkt_set to their respective next's nodes using ``rte_enqueue_next_x1()``.
|
||||
Also, one of the next_node can be updated as speculated next_node if it is more
|
||||
probable. Finally, reset ``last_spec`` to zero.
|
||||
|
||||
5. if n_left_from != 0 then goto 3) to process remaining packets.
|
||||
|
||||
6. if last_spec == nb_objs, All the objects passed were successfully speculated
|
||||
to single next node. So, the current stream can be moved to next node using
|
||||
``rte_node_next_stream_move(node, next_node)``.
|
||||
This is the ``home run`` where memcpy of buffer pointers to next node is avoided.
|
||||
|
||||
7. Update the ``node->ctx`` with more probable next node.
|
||||
|
||||
Graph object memory layout
|
||||
--------------------------
|
||||
.. _figure_graph_mem_layout:
|
||||
|
||||
.. figure:: img/graph_mem_layout.*
|
||||
|
||||
The :numref:`figure_graph_mem_layout` diagram shows ``rte_graph`` object memory
|
||||
layout. Understanding the memory layout helps to debug the graph library and
|
||||
improve the performance if needed.
|
||||
|
||||
Graph object consists of a header, circular buffer to store the pending
|
||||
stream when walking over the graph, and variable-length memory to store
|
||||
the ``rte_node`` objects.
|
||||
|
||||
The graph_nodes_mem_create() creates and populate this memory. The functions
|
||||
such as ``rte_graph_walk()`` and ``rte_node_enqueue_*`` use this memory
|
||||
to enable fastpath services.
|
||||
|
||||
Inbuilt Nodes
|
||||
-------------
|
||||
|
||||
DPDK provides a set of nodes for data processing. The following section
|
||||
details the documentation for the same.
|
||||
|
||||
ethdev_rx
|
||||
~~~~~~~~~
|
||||
This node does ``rte_eth_rx_burst()`` into stream buffer passed to it
|
||||
(src node stream) and does ``rte_node_next_stream_move()`` only when
|
||||
there are packets received. Each ``rte_node`` works only on one Rx port and
|
||||
queue that it gets from node->ctx. For each (port X, rx_queue Y),
|
||||
a rte_node is cloned from ethdev_rx_base_node as ``ethdev_rx-X-Y`` in
|
||||
``rte_node_eth_config()`` along with updating ``node->ctx``.
|
||||
Each graph needs to be associated with a unique rte_node for a (port, rx_queue).
|
||||
|
||||
ethdev_tx
|
||||
~~~~~~~~~
|
||||
This node does ``rte_eth_tx_burst()`` for a burst of objs received by it.
|
||||
It sends the burst to a fixed Tx Port and Queue information from
|
||||
node->ctx. For each (port X), this ``rte_node`` is cloned from
|
||||
ethdev_tx_node_base as "ethdev_tx-X" in ``rte_node_eth_config()``
|
||||
along with updating node->context.
|
||||
|
||||
Since each graph doesn't need more than one Txq, per port, a Txq is assigned
|
||||
based on graph id to each rte_node instance. Each graph needs to be associated
|
||||
with a rte_node for each (port).
|
||||
|
||||
pkt_drop
|
||||
~~~~~~~~
|
||||
This node frees all the objects passed to it considering them as
|
||||
``rte_mbufs`` that need to be freed.
|
||||
|
||||
ip4_lookup
|
||||
~~~~~~~~~~
|
||||
This node is an intermediate node that does LPM lookup for the received
|
||||
ipv4 packets and the result determines each packets next node.
|
||||
|
||||
On successful LPM lookup, the result contains the ``next_node`` id and
|
||||
``next-hop`` id with which the packet needs to be further processed.
|
||||
|
||||
On LPM lookup failure, objects are redirected to pkt_drop node.
|
||||
``rte_node_ip4_route_add()`` is control path API to add ipv4 routes.
|
||||
To achieve home run, node use ``rte_node_stream_move()`` as mentioned in above
|
||||
sections.
|
||||
|
||||
ip4_rewrite
|
||||
~~~~~~~~~~~
|
||||
This node gets packets from ``ip4_lookup`` node with next-hop id for each
|
||||
packet is embedded in ``node_mbuf_priv1(mbuf)->nh``. This id is used
|
||||
to determine the L2 header to be written to the packet before sending
|
||||
the packet out to a particular ethdev_tx node.
|
||||
``rte_node_ip4_rewrite_add()`` is control path API to add next-hop info.
|
||||
|
||||
null
|
||||
~~~~
|
||||
This node ignores the set of objects passed to it and reports that all are
|
||||
processed.
|
||||
|
1078
doc/guides/prog_guide/img/anatomy_of_a_node.svg
Normal file
1078
doc/guides/prog_guide/img/anatomy_of_a_node.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 104 KiB |
702
doc/guides/prog_guide/img/graph_mem_layout.svg
Normal file
702
doc/guides/prog_guide/img/graph_mem_layout.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 279 KiB |
3330
doc/guides/prog_guide/img/link_the_nodes.svg
Normal file
3330
doc/guides/prog_guide/img/link_the_nodes.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 173 KiB |
@ -58,6 +58,7 @@ Programmer's Guide
|
||||
metrics_lib
|
||||
bpf_lib
|
||||
ipsec_lib
|
||||
graph_lib
|
||||
source_org
|
||||
dev_kit_build_system
|
||||
dev_kit_root_make_help
|
||||
|
@ -212,6 +212,30 @@ New Features
|
||||
* Added IPsec inbound load-distribution support for ipsec-secgw application
|
||||
using NIC load distribution feature(Flow Director).
|
||||
|
||||
* **Added rte_graph library.**
|
||||
|
||||
Graph architecture abstracts the data processing functions as a ``node`` and
|
||||
``links`` them together to create a complex ``graph`` to enable reusable/modular
|
||||
data processing functions. The graph library provides API to enable graph
|
||||
framework operations such as create, lookup, dump and destroy on graph and node
|
||||
operations such as clone, edge update, and edge shrink, etc.
|
||||
The API also allows to create the stats cluster to monitor per graph and per node stats.
|
||||
|
||||
* **Added rte_node library which consists of a set of packet processing nodes.**
|
||||
|
||||
The rte_node library that consists of nodes used by rte_graph library. Each
|
||||
node performs a specific packet processing function based on application
|
||||
configuration. The following nodes are added:
|
||||
|
||||
* Null node: Skeleton node that defines the general structure of a node.
|
||||
* Ethernet device node: Consists of ethernet Rx/Tx nodes as well as ethernet
|
||||
control APIs.
|
||||
* IPv4 lookup node: Consists of ipv4 extract and lpm lookup node. Routes can
|
||||
be configured by the application through ``rte_node_ip4_route_add`` function.
|
||||
* IPv4 rewrite node: Consists of ipv4 and ethernet header rewrite functionality
|
||||
that can be configured through ``rte_node_ip4_rewrite_add`` function.
|
||||
* Packet drop node: Frees the packets received to their respective mempool.
|
||||
|
||||
|
||||
Removed Items
|
||||
-------------
|
||||
|
Loading…
Reference in New Issue
Block a user