e0b2658529
Add more information on alternatives of KNI and the disadvantages of KNI compared to these alternatives. Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
424 lines
17 KiB
ReStructuredText
424 lines
17 KiB
ReStructuredText
.. SPDX-License-Identifier: BSD-3-Clause
|
|
Copyright(c) 2010-2015 Intel Corporation.
|
|
|
|
.. _kni:
|
|
|
|
Kernel NIC Interface
|
|
====================
|
|
|
|
.. note::
|
|
|
|
KNI is deprecated and will be removed in future.
|
|
See :doc:`../rel_notes/deprecation`.
|
|
|
|
:ref:`virtio_user_as_exception_path` alternative is the preferred way
|
|
for interfacing with the Linux network stack
|
|
as it is an in-kernel solution and has similar performance expectations.
|
|
|
|
.. note::
|
|
|
|
KNI is disabled by default in the DPDK build.
|
|
To re-enable the library, remove 'kni' from the "disable_libs" meson option when configuring a build.
|
|
|
|
The DPDK Kernel NIC Interface (KNI) allows userspace applications access to the Linux* control plane.
|
|
|
|
KNI provides an interface with the kernel network stack
|
|
and allows management of DPDK ports using standard Linux net tools
|
|
such as ``ethtool``, ``iproute2`` and ``tcpdump``.
|
|
|
|
The main use case of KNI is to get/receive exception packets from/to Linux network stack
|
|
while main datapath IO is done bypassing the networking stack.
|
|
|
|
There are other alternatives to KNI, all are available in the upstream Linux:
|
|
|
|
#. :ref:`virtio_user_as_exception_path`
|
|
|
|
#. :doc:`../nics/tap` as wrapper to `Linux tun/tap
|
|
<https://www.kernel.org/doc/Documentation/networking/tuntap.txt>`_
|
|
|
|
The benefits of using the KNI against alternatives are:
|
|
|
|
* Faster than existing Linux TUN/TAP interfaces
|
|
(by eliminating system calls and copy_to_user()/copy_from_user() operations.
|
|
|
|
The disadvantages of the KNI are:
|
|
|
|
* It is out-of-tree Linux kernel module
|
|
which makes updating and distributing the driver more difficult.
|
|
Most users end up building the KNI driver from source
|
|
which requires the packages and tools to build kernel modules.
|
|
|
|
* As it shares memory between userspace and kernelspace,
|
|
and kernel part directly uses input provided by userspace, it is not safe.
|
|
This makes hard to upstream the module.
|
|
|
|
* Requires dedicated kernel cores.
|
|
|
|
* Only a subset of net devices control commands are supported by KNI.
|
|
|
|
The components of an application using the DPDK Kernel NIC Interface are shown in :numref:`figure_kernel_nic_intf`.
|
|
|
|
.. _figure_kernel_nic_intf:
|
|
|
|
.. figure:: img/kernel_nic_intf.*
|
|
|
|
Components of a DPDK KNI Application
|
|
|
|
|
|
The DPDK KNI Kernel Module
|
|
--------------------------
|
|
|
|
The KNI kernel loadable module ``rte_kni`` provides the kernel interface
|
|
for DPDK applications.
|
|
|
|
When the ``rte_kni`` module is loaded, it will create a device ``/dev/kni``
|
|
that is used by the DPDK KNI API functions to control and communicate with
|
|
the kernel module.
|
|
|
|
The ``rte_kni`` kernel module contains several optional parameters which
|
|
can be specified when the module is loaded to control its behavior:
|
|
|
|
.. code-block:: console
|
|
|
|
# modinfo rte_kni.ko
|
|
<snip>
|
|
parm: lo_mode: KNI loopback mode (default=lo_mode_none):
|
|
lo_mode_none Kernel loopback disabled
|
|
lo_mode_fifo Enable kernel loopback with fifo
|
|
lo_mode_fifo_skb Enable kernel loopback with fifo and skb buffer
|
|
(charp)
|
|
parm: kthread_mode: Kernel thread mode (default=single):
|
|
single Single kernel thread mode enabled.
|
|
multiple Multiple kernel thread mode enabled.
|
|
(charp)
|
|
parm: carrier: Default carrier state for KNI interface (default=off):
|
|
off Interfaces will be created with carrier state set to off.
|
|
on Interfaces will be created with carrier state set to on.
|
|
(charp)
|
|
parm: enable_bifurcated: Enable request processing support for
|
|
bifurcated drivers, which means releasing rtnl_lock before calling
|
|
userspace callback and supporting async requests (default=off):
|
|
on Enable request processing support for bifurcated drivers.
|
|
(charp)
|
|
parm: min_scheduling_interval: KNI thread min scheduling interval (default=100 microseconds)
|
|
(long)
|
|
parm: max_scheduling_interval: KNI thread max scheduling interval (default=200 microseconds)
|
|
(long)
|
|
|
|
|
|
Loading the ``rte_kni`` kernel module without any optional parameters is
|
|
the typical way a DPDK application gets packets into and out of the kernel
|
|
network stack. Without any parameters, only one kernel thread is created
|
|
for all KNI devices for packet receiving in kernel side, loopback mode is
|
|
disabled, and the default carrier state of KNI interfaces is set to *off*.
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko
|
|
|
|
.. _kni_loopback_mode:
|
|
|
|
Loopback Mode
|
|
~~~~~~~~~~~~~
|
|
|
|
For testing, the ``rte_kni`` kernel module can be loaded in loopback mode
|
|
by specifying the ``lo_mode`` parameter:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko lo_mode=lo_mode_fifo
|
|
|
|
The ``lo_mode_fifo`` loopback option will loop back ring enqueue/dequeue
|
|
operations in kernel space.
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko lo_mode=lo_mode_fifo_skb
|
|
|
|
The ``lo_mode_fifo_skb`` loopback option will loop back ring enqueue/dequeue
|
|
operations and sk buffer copies in kernel space.
|
|
|
|
If the ``lo_mode`` parameter is not specified, loopback mode is disabled.
|
|
|
|
.. _kni_kernel_thread_mode:
|
|
|
|
Kernel Thread Mode
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
To provide flexibility of performance, the ``rte_kni`` KNI kernel module
|
|
can be loaded with the ``kthread_mode`` parameter. The ``rte_kni`` kernel
|
|
module supports two options: "single kernel thread" mode and "multiple
|
|
kernel thread" mode.
|
|
|
|
Single kernel thread mode is enabled as follows:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko kthread_mode=single
|
|
|
|
This mode will create only one kernel thread for all KNI interfaces to
|
|
receive data on the kernel side. By default, this kernel thread is not
|
|
bound to any particular core, but the user can set the core affinity for
|
|
this kernel thread by setting the ``core_id`` and ``force_bind`` parameters
|
|
in ``struct rte_kni_conf`` when the first KNI interface is created:
|
|
|
|
For optimum performance, the kernel thread should be bound to a core in
|
|
on the same socket as the DPDK lcores used in the application.
|
|
|
|
The KNI kernel module can also be configured to start a separate kernel
|
|
thread for each KNI interface created by the DPDK application. Multiple
|
|
kernel thread mode is enabled as follows:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko kthread_mode=multiple
|
|
|
|
This mode will create a separate kernel thread for each KNI interface to
|
|
receive data on the kernel side. The core affinity of each ``kni_thread``
|
|
kernel thread can be specified by setting the ``core_id`` and ``force_bind``
|
|
parameters in ``struct rte_kni_conf`` when each KNI interface is created.
|
|
|
|
Multiple kernel thread mode can provide scalable higher performance if
|
|
sufficient unused cores are available on the host system.
|
|
|
|
If the ``kthread_mode`` parameter is not specified, the "single kernel
|
|
thread" mode is used.
|
|
|
|
.. _kni_default_carrier_state:
|
|
|
|
Default Carrier State
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The default carrier state of KNI interfaces created by the ``rte_kni``
|
|
kernel module is controlled via the ``carrier`` option when the module
|
|
is loaded.
|
|
|
|
If ``carrier=off`` is specified, the kernel module will leave the carrier
|
|
state of the interface *down* when the interface is management enabled.
|
|
The DPDK application can set the carrier state of the KNI interface using the
|
|
``rte_kni_update_link()`` function. This is useful for DPDK applications
|
|
which require that the carrier state of the KNI interface reflect the
|
|
actual link state of the corresponding physical NIC port.
|
|
|
|
If ``carrier=on`` is specified, the kernel module will automatically set
|
|
the carrier state of the interface to *up* when the interface is management
|
|
enabled. This is useful for DPDK applications which use the KNI interface as
|
|
a purely virtual interface that does not correspond to any physical hardware
|
|
and do not wish to explicitly set the carrier state of the interface with
|
|
``rte_kni_update_link()``. It is also useful for testing in loopback mode
|
|
where the NIC port may not be physically connected to anything.
|
|
|
|
To set the default carrier state to *on*:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko carrier=on
|
|
|
|
To set the default carrier state to *off*:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko carrier=off
|
|
|
|
If the ``carrier`` parameter is not specified, the default carrier state
|
|
of KNI interfaces will be set to *off*.
|
|
|
|
.. _kni_bifurcated_device_support:
|
|
|
|
Bifurcated Device Support
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
User callbacks are executed while kernel module holds the ``rtnl`` lock, this
|
|
causes a deadlock when callbacks run control commands on another Linux kernel
|
|
network interface.
|
|
|
|
Bifurcated devices has kernel network driver part and to prevent deadlock for
|
|
them ``enable_bifurcated`` is used.
|
|
|
|
To enable bifurcated device support:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko enable_bifurcated=on
|
|
|
|
Enabling bifurcated device support releases ``rtnl`` lock before calling
|
|
callback and locks it back after callback. Also enables asynchronous request to
|
|
support callbacks that requires rtnl lock to work (interface down).
|
|
|
|
KNI Kthread Scheduling
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The ``min_scheduling_interval`` and ``max_scheduling_interval`` parameters
|
|
control the rescheduling interval of the KNI kthreads.
|
|
|
|
This might be useful if we have use cases in which we require improved
|
|
latency or performance for control plane traffic.
|
|
|
|
The implementation is backed by Linux High Precision Timers, and uses ``usleep_range``.
|
|
Hence, it will have the same granularity constraints as this Linux subsystem.
|
|
|
|
For Linux High Precision Timers, you can check the following resource: `Kernel Timers <http://www.kernel.org/doc/Documentation/timers/timers-howto.txt>`_
|
|
|
|
To set the ``min_scheduling_interval`` to a value of 100 microseconds:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko min_scheduling_interval=100
|
|
|
|
To set the ``max_scheduling_interval`` to a value of 200 microseconds:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko max_scheduling_interval=200
|
|
|
|
If the ``min_scheduling_interval`` and ``max_scheduling_interval`` parameters are
|
|
not specified, the default interval limits will be set to *100* and *200* respectively.
|
|
|
|
KNI Creation and Deletion
|
|
-------------------------
|
|
|
|
Before any KNI interfaces can be created, the ``rte_kni`` kernel module must
|
|
be loaded into the kernel and configured with the ``rte_kni_init()`` function.
|
|
|
|
The KNI interfaces are created by a DPDK application dynamically via the
|
|
``rte_kni_alloc()`` function.
|
|
|
|
The ``struct rte_kni_conf`` structure contains fields which allow the
|
|
user to specify the interface name, set the MTU size, set an explicit or
|
|
random MAC address and control the affinity of the kernel Rx thread(s)
|
|
(both single and multi-threaded modes).
|
|
By default the KNI sample example gets the MTU from the matching device,
|
|
and in case of KNI PMD it is derived from mbuf buffer length.
|
|
|
|
The ``struct rte_kni_ops`` structure contains pointers to functions to
|
|
handle requests from the ``rte_kni`` kernel module. These functions
|
|
allow DPDK applications to perform actions when the KNI interfaces are
|
|
manipulated by control commands or functions external to the application.
|
|
|
|
For example, the DPDK application may wish to enabled/disable a physical
|
|
NIC port when a user enabled/disables a KNI interface with ``ip link set
|
|
[up|down] dev <ifaceX>``. The DPDK application can register a callback for
|
|
``config_network_if`` which will be called when the interface management
|
|
state changes.
|
|
|
|
There are currently four callbacks for which the user can register
|
|
application functions:
|
|
|
|
``config_network_if``:
|
|
|
|
Called when the management state of the KNI interface changes.
|
|
For example, when the user runs ``ip link set [up|down] dev <ifaceX>``.
|
|
|
|
``change_mtu``:
|
|
|
|
Called when the user changes the MTU size of the KNI
|
|
interface. For example, when the user runs ``ip link set mtu <size>
|
|
dev <ifaceX>``.
|
|
|
|
``config_mac_address``:
|
|
|
|
Called when the user changes the MAC address of the KNI interface.
|
|
For example, when the user runs ``ip link set address <MAC>
|
|
dev <ifaceX>``. If the user sets this callback function to NULL,
|
|
but sets the ``port_id`` field to a value other than -1, a default
|
|
callback handler in the rte_kni library ``kni_config_mac_address()``
|
|
will be called which calls ``rte_eth_dev_default_mac_addr_set()``
|
|
on the specified ``port_id``.
|
|
|
|
``config_promiscusity``:
|
|
|
|
Called when the user changes the promiscuity state of the KNI
|
|
interface. For example, when the user runs ``ip link set promisc
|
|
[on|off] dev <ifaceX>``. If the user sets this callback function to
|
|
NULL, but sets the ``port_id`` field to a value other than -1, a default
|
|
callback handler in the rte_kni library ``kni_config_promiscusity()``
|
|
will be called which calls ``rte_eth_promiscuous_<enable|disable>()``
|
|
on the specified ``port_id``.
|
|
|
|
``config_allmulticast``:
|
|
|
|
Called when the user changes the allmulticast state of the KNI interface.
|
|
For example, when the user runs ``ifconfig <ifaceX> [-]allmulti``. If the
|
|
user sets this callback function to NULL, but sets the ``port_id`` field to
|
|
a value other than -1, a default callback handler in the rte_kni library
|
|
``kni_config_allmulticast()`` will be called which calls
|
|
``rte_eth_allmulticast_<enable|disable>()`` on the specified ``port_id``.
|
|
|
|
In order to run these callbacks, the application must periodically call
|
|
the ``rte_kni_handle_request()`` function. Any user callback function
|
|
registered will be called directly from ``rte_kni_handle_request()`` so
|
|
care must be taken to prevent deadlock and to not block any DPDK fastpath
|
|
tasks. Typically DPDK applications which use these callbacks will need
|
|
to create a separate thread or secondary process to periodically call
|
|
``rte_kni_handle_request()``.
|
|
|
|
The KNI interfaces can be deleted by a DPDK application with
|
|
``rte_kni_release()``. All KNI interfaces not explicitly deleted will be
|
|
deleted when the ``/dev/kni`` device is closed, either explicitly with
|
|
``rte_kni_close()`` or when the DPDK application is closed.
|
|
|
|
DPDK mbuf Flow
|
|
--------------
|
|
|
|
To minimize the amount of DPDK code running in kernel space, the mbuf mempool is managed in userspace only.
|
|
The kernel module will be aware of mbufs,
|
|
but all mbuf allocation and free operations will be handled by the DPDK application only.
|
|
|
|
:numref:`figure_pkt_flow_kni` shows a typical scenario with packets sent in both directions.
|
|
|
|
.. _figure_pkt_flow_kni:
|
|
|
|
.. figure:: img/pkt_flow_kni.*
|
|
|
|
Packet Flow via mbufs in the DPDK KNI
|
|
|
|
|
|
Use Case: Ingress
|
|
-----------------
|
|
|
|
On the DPDK RX side, the mbuf is allocated by the PMD in the RX thread context.
|
|
This thread will enqueue the mbuf in the rx_q FIFO,
|
|
and the next pointers in mbuf-chain will convert to physical address.
|
|
The KNI thread will poll all KNI active devices for the rx_q.
|
|
If an mbuf is dequeued, it will be converted to a sk_buff and sent to the net stack via netif_rx().
|
|
The dequeued mbuf must be freed, so the same pointer is sent back in the free_q FIFO,
|
|
and next pointers must convert back to virtual address if exists before put in the free_q FIFO.
|
|
|
|
The RX thread, in the same main loop, polls this FIFO and frees the mbuf after dequeuing it.
|
|
The address conversion of the next pointer is to prevent the chained mbuf
|
|
in different hugepage segments from causing kernel crash.
|
|
|
|
Use Case: Egress
|
|
----------------
|
|
|
|
For packet egress the DPDK application must first enqueue several mbufs to create an mbuf cache on the kernel side.
|
|
|
|
The packet is received from the Linux net stack, by calling the kni_net_tx() callback.
|
|
The mbuf is dequeued (without waiting due the cache) and filled with data from sk_buff.
|
|
The sk_buff is then freed and the mbuf sent in the tx_q FIFO.
|
|
|
|
The DPDK TX thread dequeues the mbuf and sends it to the PMD via ``rte_eth_tx_burst()``.
|
|
It then puts the mbuf back in the cache.
|
|
|
|
IOVA = VA: Support
|
|
------------------
|
|
|
|
KNI operates in IOVA_VA scheme when
|
|
|
|
- LINUX_VERSION_CODE >= KERNEL_VERSION(4, 10, 0) and
|
|
- EAL option `iova-mode=va` is passed or bus IOVA scheme in the DPDK is selected
|
|
as RTE_IOVA_VA.
|
|
|
|
Due to IOVA to KVA address translations, based on the KNI use case there
|
|
can be a performance impact. For mitigation, forcing IOVA to PA via EAL
|
|
"--iova-mode=pa" option can be used, IOVA_DC bus iommu scheme can also
|
|
result in IOVA as PA.
|
|
|
|
Ethtool
|
|
-------
|
|
|
|
Ethtool is a Linux-specific tool with corresponding support in the kernel.
|
|
The current version of kni provides minimal ethtool functionality
|
|
including querying version and link state. It does not support link
|
|
control, statistics, or dumping device registers.
|