5569dd7d90
The Kni kthreads seem to be re-scheduled at a granularity of roughly 1 millisecond right now, which seems to be insufficient for performing tests involving a lot of control plane traffic. Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it seems that the existing code cannot reschedule at the desired granularily, due to precision constraints of schedule_timeout_interruptible(). In our use case, we leverage the Linux Kernel for control plane, and it is not uncommon to have 60K - 100K pps for some signaling protocols. Since we are not in atomic context, the usleep_range() function seems to be more appropriate for being able to introduce smaller controlled delays, in the range of 5-10 microseconds. Upon reading the existing code, it would seem that this was the original intent. Adding sub-millisecond delays, seems unfeasible with a call to schedule_timeout_interruptible(). KNI_KTHREAD_RESCHEDULE_INTERVAL 5 /* us */ schedule_timeout_interruptible( usecs_to_jiffies(KNI_KTHREAD_RESCHEDULE_INTERVAL)); Below, we attempted a brief comparison between the existing implementation, which uses schedule_timeout_interruptible() and usleep_range(). We attempt to measure the CPU usage, and RTT between two Kni interfaces, which are created on top of vmxnet3 adapters, connected by a vSwitch. insmod rte_kni.ko kthread_mode=single carrier=on schedule_timeout_interruptible(usecs_to_jiffies(5)) kni_single CPU Usage: 2-4 % [root@localhost ~]# ping 1.1.1.2 -I eth1 PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 eth1: 56(84) bytes of data. 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.70 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.00 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.99 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.985 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.00 ms usleep_range(5, 10) kni_single CPU usage: 50% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.338 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.150 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.123 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.139 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.159 ms usleep_range(20, 50) kni_single CPU usage: 24% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.170 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.171 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.248 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.185 ms usleep_range(50, 100) kni_single CPU usage: 13% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.537 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.257 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.231 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.143 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.200 ms usleep_range(100, 200) kni_single CPU usage: 7% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.716 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.167 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.459 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.455 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=0.252 ms usleep_range(1000, 1100) kni_single CPU usage: 2% 64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=2.22 ms 64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=1.17 ms 64 bytes from 1.1.1.2: icmp_seq=5 ttl=64 time=1.15 ms Upon testing, usleep_range(1000, 1100) seems roughly equivalent in latency and cpu usage to the variant with schedule_timeout_interruptible(), while usleep_range(100, 200) seems to give a decent tradeoff between latency and cpu usage, while allowing users to tweak the limits for improved precision if they have such use cases. Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a softlockup on my kernel. Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 1226 Comm: kni_single Tainted: G W O 3.10 #1 <IRQ> [<ffffffff814f84de>] dump_stack+0x19/0x1b [<ffffffff814f7891>] panic+0xcd/0x1e0 [<ffffffff810993b0>] watchdog_timer_fn+0x160/0x160 [<ffffffff810644b2>] __run_hrtimer.isra.4+0x42/0xd0 [<ffffffff81064b57>] hrtimer_interrupt+0xe7/0x1f0 [<ffffffff8102cd57>] smp_apic_timer_interrupt+0x67/0xa0 [<ffffffff8150321d>] apic_timer_interrupt+0x6d/0x80 This patch also attempts to remove this option. References: [1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com> Acked-by: Padraig Connolly <Padraig.J.Connolly@intel.com> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
385 lines
15 KiB
ReStructuredText
385 lines
15 KiB
ReStructuredText
.. SPDX-License-Identifier: BSD-3-Clause
|
|
Copyright(c) 2010-2015 Intel Corporation.
|
|
|
|
.. _kni:
|
|
|
|
Kernel NIC Interface
|
|
====================
|
|
|
|
The DPDK Kernel NIC Interface (KNI) allows userspace applications access to the Linux* control plane.
|
|
|
|
The benefits of using the DPDK KNI are:
|
|
|
|
* Faster than existing Linux TUN/TAP interfaces
|
|
(by eliminating system calls and copy_to_user()/copy_from_user() operations.
|
|
|
|
* Allows management of DPDK ports using standard Linux net tools such as ethtool, ifconfig and tcpdump.
|
|
|
|
* Allows an interface with the kernel network stack.
|
|
|
|
The components of an application using the DPDK Kernel NIC Interface are shown in :numref:`figure_kernel_nic_intf`.
|
|
|
|
.. _figure_kernel_nic_intf:
|
|
|
|
.. figure:: img/kernel_nic_intf.*
|
|
|
|
Components of a DPDK KNI Application
|
|
|
|
|
|
The DPDK KNI Kernel Module
|
|
--------------------------
|
|
|
|
The KNI kernel loadable module ``rte_kni`` provides the kernel interface
|
|
for DPDK applications.
|
|
|
|
When the ``rte_kni`` module is loaded, it will create a device ``/dev/kni``
|
|
that is used by the DPDK KNI API functions to control and communicate with
|
|
the kernel module.
|
|
|
|
The ``rte_kni`` kernel module contains several optional parameters which
|
|
can be specified when the module is loaded to control its behavior:
|
|
|
|
.. code-block:: console
|
|
|
|
# modinfo rte_kni.ko
|
|
<snip>
|
|
parm: lo_mode: KNI loopback mode (default=lo_mode_none):
|
|
lo_mode_none Kernel loopback disabled
|
|
lo_mode_fifo Enable kernel loopback with fifo
|
|
lo_mode_fifo_skb Enable kernel loopback with fifo and skb buffer
|
|
(charp)
|
|
parm: kthread_mode: Kernel thread mode (default=single):
|
|
single Single kernel thread mode enabled.
|
|
multiple Multiple kernel thread mode enabled.
|
|
(charp)
|
|
parm: carrier: Default carrier state for KNI interface (default=off):
|
|
off Interfaces will be created with carrier state set to off.
|
|
on Interfaces will be created with carrier state set to on.
|
|
(charp)
|
|
parm: enable_bifurcated: Enable request processing support for
|
|
bifurcated drivers, which means releasing rtnl_lock before calling
|
|
userspace callback and supporting async requests (default=off):
|
|
on Enable request processing support for bifurcated drivers.
|
|
(charp)
|
|
parm: min_scheduling_interval: KNI thread min scheduling interval (default=100 microseconds)
|
|
(long)
|
|
parm: max_scheduling_interval: KNI thread max scheduling interval (default=200 microseconds)
|
|
(long)
|
|
|
|
|
|
Loading the ``rte_kni`` kernel module without any optional parameters is
|
|
the typical way a DPDK application gets packets into and out of the kernel
|
|
network stack. Without any parameters, only one kernel thread is created
|
|
for all KNI devices for packet receiving in kernel side, loopback mode is
|
|
disabled, and the default carrier state of KNI interfaces is set to *off*.
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko
|
|
|
|
.. _kni_loopback_mode:
|
|
|
|
Loopback Mode
|
|
~~~~~~~~~~~~~
|
|
|
|
For testing, the ``rte_kni`` kernel module can be loaded in loopback mode
|
|
by specifying the ``lo_mode`` parameter:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko lo_mode=lo_mode_fifo
|
|
|
|
The ``lo_mode_fifo`` loopback option will loop back ring enqueue/dequeue
|
|
operations in kernel space.
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko lo_mode=lo_mode_fifo_skb
|
|
|
|
The ``lo_mode_fifo_skb`` loopback option will loop back ring enqueue/dequeue
|
|
operations and sk buffer copies in kernel space.
|
|
|
|
If the ``lo_mode`` parameter is not specified, loopback mode is disabled.
|
|
|
|
.. _kni_kernel_thread_mode:
|
|
|
|
Kernel Thread Mode
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
To provide flexibility of performance, the ``rte_kni`` KNI kernel module
|
|
can be loaded with the ``kthread_mode`` parameter. The ``rte_kni`` kernel
|
|
module supports two options: "single kernel thread" mode and "multiple
|
|
kernel thread" mode.
|
|
|
|
Single kernel thread mode is enabled as follows:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko kthread_mode=single
|
|
|
|
This mode will create only one kernel thread for all KNI interfaces to
|
|
receive data on the kernel side. By default, this kernel thread is not
|
|
bound to any particular core, but the user can set the core affinity for
|
|
this kernel thread by setting the ``core_id`` and ``force_bind`` parameters
|
|
in ``struct rte_kni_conf`` when the first KNI interface is created:
|
|
|
|
For optimum performance, the kernel thread should be bound to a core in
|
|
on the same socket as the DPDK lcores used in the application.
|
|
|
|
The KNI kernel module can also be configured to start a separate kernel
|
|
thread for each KNI interface created by the DPDK application. Multiple
|
|
kernel thread mode is enabled as follows:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko kthread_mode=multiple
|
|
|
|
This mode will create a separate kernel thread for each KNI interface to
|
|
receive data on the kernel side. The core affinity of each ``kni_thread``
|
|
kernel thread can be specified by setting the ``core_id`` and ``force_bind``
|
|
parameters in ``struct rte_kni_conf`` when each KNI interface is created.
|
|
|
|
Multiple kernel thread mode can provide scalable higher performance if
|
|
sufficient unused cores are available on the host system.
|
|
|
|
If the ``kthread_mode`` parameter is not specified, the "single kernel
|
|
thread" mode is used.
|
|
|
|
.. _kni_default_carrier_state:
|
|
|
|
Default Carrier State
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The default carrier state of KNI interfaces created by the ``rte_kni``
|
|
kernel module is controlled via the ``carrier`` option when the module
|
|
is loaded.
|
|
|
|
If ``carrier=off`` is specified, the kernel module will leave the carrier
|
|
state of the interface *down* when the interface is management enabled.
|
|
The DPDK application can set the carrier state of the KNI interface using the
|
|
``rte_kni_update_link()`` function. This is useful for DPDK applications
|
|
which require that the carrier state of the KNI interface reflect the
|
|
actual link state of the corresponding physical NIC port.
|
|
|
|
If ``carrier=on`` is specified, the kernel module will automatically set
|
|
the carrier state of the interface to *up* when the interface is management
|
|
enabled. This is useful for DPDK applications which use the KNI interface as
|
|
a purely virtual interface that does not correspond to any physical hardware
|
|
and do not wish to explicitly set the carrier state of the interface with
|
|
``rte_kni_update_link()``. It is also useful for testing in loopback mode
|
|
where the NIC port may not be physically connected to anything.
|
|
|
|
To set the default carrier state to *on*:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko carrier=on
|
|
|
|
To set the default carrier state to *off*:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko carrier=off
|
|
|
|
If the ``carrier`` parameter is not specified, the default carrier state
|
|
of KNI interfaces will be set to *off*.
|
|
|
|
.. _kni_bifurcated_device_support:
|
|
|
|
Bifurcated Device Support
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
User callbacks are executed while kernel module holds the ``rtnl`` lock, this
|
|
causes a deadlock when callbacks run control commands on another Linux kernel
|
|
network interface.
|
|
|
|
Bifurcated devices has kernel network driver part and to prevent deadlock for
|
|
them ``enable_bifurcated`` is used.
|
|
|
|
To enable bifurcated device support:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko enable_bifurcated=on
|
|
|
|
Enabling bifurcated device support releases ``rtnl`` lock before calling
|
|
callback and locks it back after callback. Also enables asynchronous request to
|
|
support callbacks that requires rtnl lock to work (interface down).
|
|
|
|
KNI Kthread Scheduling
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The ``min_scheduling_interval`` and ``max_scheduling_interval`` parameters
|
|
control the rescheduling interval of the KNI kthreads.
|
|
|
|
This might be useful if we have use cases in which we require improved
|
|
latency or performance for control plane traffic.
|
|
|
|
The implementation is backed by Linux High Precision Timers, and uses ``usleep_range``.
|
|
Hence, it will have the same granularity constraints as this Linux subsystem.
|
|
|
|
For Linux High Precision Timers, you can check the following resource: `Kernel Timers <http://www.kernel.org/doc/Documentation/timers/timers-howto.txt>`_
|
|
|
|
To set the ``min_scheduling_interval`` to a value of 100 microseconds:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko min_scheduling_interval=100
|
|
|
|
To set the ``max_scheduling_interval`` to a value of 200 microseconds:
|
|
|
|
.. code-block:: console
|
|
|
|
# insmod <build_dir>/kernel/linux/kni/rte_kni.ko max_scheduling_interval=200
|
|
|
|
If the ``min_scheduling_interval`` and ``max_scheduling_interval`` parameters are
|
|
not specified, the default interval limits will be set to *100* and *200* respectively.
|
|
|
|
KNI Creation and Deletion
|
|
-------------------------
|
|
|
|
Before any KNI interfaces can be created, the ``rte_kni`` kernel module must
|
|
be loaded into the kernel and configured with the ``rte_kni_init()`` function.
|
|
|
|
The KNI interfaces are created by a DPDK application dynamically via the
|
|
``rte_kni_alloc()`` function.
|
|
|
|
The ``struct rte_kni_conf`` structure contains fields which allow the
|
|
user to specify the interface name, set the MTU size, set an explicit or
|
|
random MAC address and control the affinity of the kernel Rx thread(s)
|
|
(both single and multi-threaded modes).
|
|
By default the KNI sample example gets the MTU from the matching device,
|
|
and in case of KNI PMD it is derived from mbuf buffer length.
|
|
|
|
The ``struct rte_kni_ops`` structure contains pointers to functions to
|
|
handle requests from the ``rte_kni`` kernel module. These functions
|
|
allow DPDK applications to perform actions when the KNI interfaces are
|
|
manipulated by control commands or functions external to the application.
|
|
|
|
For example, the DPDK application may wish to enabled/disable a physical
|
|
NIC port when a user enabled/disables a KNI interface with ``ip link set
|
|
[up|down] dev <ifaceX>``. The DPDK application can register a callback for
|
|
``config_network_if`` which will be called when the interface management
|
|
state changes.
|
|
|
|
There are currently four callbacks for which the user can register
|
|
application functions:
|
|
|
|
``config_network_if``:
|
|
|
|
Called when the management state of the KNI interface changes.
|
|
For example, when the user runs ``ip link set [up|down] dev <ifaceX>``.
|
|
|
|
``change_mtu``:
|
|
|
|
Called when the user changes the MTU size of the KNI
|
|
interface. For example, when the user runs ``ip link set mtu <size>
|
|
dev <ifaceX>``.
|
|
|
|
``config_mac_address``:
|
|
|
|
Called when the user changes the MAC address of the KNI interface.
|
|
For example, when the user runs ``ip link set address <MAC>
|
|
dev <ifaceX>``. If the user sets this callback function to NULL,
|
|
but sets the ``port_id`` field to a value other than -1, a default
|
|
callback handler in the rte_kni library ``kni_config_mac_address()``
|
|
will be called which calls ``rte_eth_dev_default_mac_addr_set()``
|
|
on the specified ``port_id``.
|
|
|
|
``config_promiscusity``:
|
|
|
|
Called when the user changes the promiscuity state of the KNI
|
|
interface. For example, when the user runs ``ip link set promisc
|
|
[on|off] dev <ifaceX>``. If the user sets this callback function to
|
|
NULL, but sets the ``port_id`` field to a value other than -1, a default
|
|
callback handler in the rte_kni library ``kni_config_promiscusity()``
|
|
will be called which calls ``rte_eth_promiscuous_<enable|disable>()``
|
|
on the specified ``port_id``.
|
|
|
|
``config_allmulticast``:
|
|
|
|
Called when the user changes the allmulticast state of the KNI interface.
|
|
For example, when the user runs ``ifconfig <ifaceX> [-]allmulti``. If the
|
|
user sets this callback function to NULL, but sets the ``port_id`` field to
|
|
a value other than -1, a default callback handler in the rte_kni library
|
|
``kni_config_allmulticast()`` will be called which calls
|
|
``rte_eth_allmulticast_<enable|disable>()`` on the specified ``port_id``.
|
|
|
|
In order to run these callbacks, the application must periodically call
|
|
the ``rte_kni_handle_request()`` function. Any user callback function
|
|
registered will be called directly from ``rte_kni_handle_request()`` so
|
|
care must be taken to prevent deadlock and to not block any DPDK fastpath
|
|
tasks. Typically DPDK applications which use these callbacks will need
|
|
to create a separate thread or secondary process to periodically call
|
|
``rte_kni_handle_request()``.
|
|
|
|
The KNI interfaces can be deleted by a DPDK application with
|
|
``rte_kni_release()``. All KNI interfaces not explicitly deleted will be
|
|
deleted when the ``/dev/kni`` device is closed, either explicitly with
|
|
``rte_kni_close()`` or when the DPDK application is closed.
|
|
|
|
DPDK mbuf Flow
|
|
--------------
|
|
|
|
To minimize the amount of DPDK code running in kernel space, the mbuf mempool is managed in userspace only.
|
|
The kernel module will be aware of mbufs,
|
|
but all mbuf allocation and free operations will be handled by the DPDK application only.
|
|
|
|
:numref:`figure_pkt_flow_kni` shows a typical scenario with packets sent in both directions.
|
|
|
|
.. _figure_pkt_flow_kni:
|
|
|
|
.. figure:: img/pkt_flow_kni.*
|
|
|
|
Packet Flow via mbufs in the DPDK KNI
|
|
|
|
|
|
Use Case: Ingress
|
|
-----------------
|
|
|
|
On the DPDK RX side, the mbuf is allocated by the PMD in the RX thread context.
|
|
This thread will enqueue the mbuf in the rx_q FIFO,
|
|
and the next pointers in mbuf-chain will convert to physical address.
|
|
The KNI thread will poll all KNI active devices for the rx_q.
|
|
If an mbuf is dequeued, it will be converted to a sk_buff and sent to the net stack via netif_rx().
|
|
The dequeued mbuf must be freed, so the same pointer is sent back in the free_q FIFO,
|
|
and next pointers must convert back to virtual address if exists before put in the free_q FIFO.
|
|
|
|
The RX thread, in the same main loop, polls this FIFO and frees the mbuf after dequeuing it.
|
|
The address conversion of the next pointer is to prevent the chained mbuf
|
|
in different hugepage segments from causing kernel crash.
|
|
|
|
Use Case: Egress
|
|
----------------
|
|
|
|
For packet egress the DPDK application must first enqueue several mbufs to create an mbuf cache on the kernel side.
|
|
|
|
The packet is received from the Linux net stack, by calling the kni_net_tx() callback.
|
|
The mbuf is dequeued (without waiting due the cache) and filled with data from sk_buff.
|
|
The sk_buff is then freed and the mbuf sent in the tx_q FIFO.
|
|
|
|
The DPDK TX thread dequeues the mbuf and sends it to the PMD via ``rte_eth_tx_burst()``.
|
|
It then puts the mbuf back in the cache.
|
|
|
|
IOVA = VA: Support
|
|
------------------
|
|
|
|
KNI operates in IOVA_VA scheme when
|
|
|
|
- LINUX_VERSION_CODE >= KERNEL_VERSION(4, 10, 0) and
|
|
- EAL option `iova-mode=va` is passed or bus IOVA scheme in the DPDK is selected
|
|
as RTE_IOVA_VA.
|
|
|
|
Due to IOVA to KVA address translations, based on the KNI use case there
|
|
can be a performance impact. For mitigation, forcing IOVA to PA via EAL
|
|
"--iova-mode=pa" option can be used, IOVA_DC bus iommu scheme can also
|
|
result in IOVA as PA.
|
|
|
|
Ethtool
|
|
-------
|
|
|
|
Ethtool is a Linux-specific tool with corresponding support in the kernel.
|
|
The current version of kni provides minimal ethtool functionality
|
|
including querying version and link state. It does not support link
|
|
control, statistics, or dumping device registers.
|