If I execute "l2fwd -c f -m 64 -n 3 -- -p 3", I get the following output, indicating that there are no socket 0 hugepages to allocate the mbuf and ring structures to?
I have set up a total of 1024 Hugepages (that is, allocated 512 2M pages to each NUMA node).
The -m command line parameter does not guarantee that huge pages will be reserved on specific sockets. Therefore, allocated huge pages may not be on socket 0.
To request memory to be reserved on a specific socket, please use the --socket-mem command-line parameter instead of -m.
I am running a 32-bit DPDK application on a NUMA system, and sometimes the application initializes fine but cannot allocate memory. Why is that happening?
32-bit applications have limitations in terms of how much virtual memory is available, hence the number of hugepages they are able to allocate is also limited (1 GB per page size).
If your system has a lot (>1 GB per page size) of hugepage memory, not all of it will be allocated.
Due to hugepages typically being allocated on a local NUMA node, the hugepages allocation the application gets during the initialization depends on which
NUMA node it is running on (the EAL does not affinitize cores until much later in the initialization process).
For example, if your EAL coremask is 0xff0, the master core will usually be the first core in the coremask (0x10); this is what you have to supply to taskset::
Yes, each EAL has a configuration file that is located in the /config directory. Within each configuration file, you will find CONFIG_RTE_LOG_LEVEL=8.
You can change this to a lower value, such as 6 to reduce this printout of debug information. The following is a list of LOG levels that can be found in the rte_log.h file.
You must remove, then rebuild, the EAL directory for the change to become effective as the configuration file creates the rte_config.h file in the EAL directory.
..code-block:: c
#define RTE_LOG_EMERG 1U /* System is unusable. */
#define RTE_LOG_ALERT 2U /* Action must be taken immediately. */
This allows the application to request 16 packets at a time from the PMD.
The testpmd application then immediately attempts to transmit all the packets that were received, in this case, all 16 packets.
The packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
This behavior is desirable when tuning for high throughput because the cost of tail pointer updates to both the RX and TX queues
can be spread across 16 packets, effectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
However, this is not very desirable when tuning for low latency, because the first packet that was received must also wait for the other 15 packets to be received.
It cannot be transmitted until the other 15 packets have also been processed because the NIC will not know to transmit the packets until the TX tail pointer has been updated,
which is not done until all 16 packets have been processed for transmission.
To consistently achieve low latency even under heavy system load, the application developer should avoid processing packets in bunches.
The testpmd application can be configured from the command line to use a burst value of 1.
This allows a single packet to be processed at a time, providing lower latency, but with the added cost of lower throughput.
Using eight logical cores on each processor with RSS set to distribute network load from two 10 GbE interfaces to the cores on each processor.
Without NUMA enabled, memory is allocated from both sockets, since memory is interleaved.
Therefore, each 64B chunk is interleaved across both memory domains.
The first 64B chunk is mapped to node 0, the second 64B chunk is mapped to node 1, the third to node 0, the fourth to node 1.
If you allocated 256B, you would get memory that looks like this:
..code-block:: console
256B buffer
Offset 0x00 - Node 0
Offset 0x40 - Node 1
Offset 0x80 - Node 0
Offset 0xc0 - Node 1
Therefore, packet buffers and descriptor rings are allocated from both memory domains, thus incurring QPI bandwidth accessing the other memory and much higher latency.
For best performance with NUMA disabled, only one socket should be populated.
As the DPDK operates, it opens a lot of files, which can result in reaching the open files limits, which is set using the ulimit command or in the limits.conf file.
This is especially true when using a large number (>512) of 2 MB huge pages. Please increase the open file limit if your application is not able to open files.
In this case, this has to be done manually on the VM host, using the following command:
..code-block:: console
ip link set <interface> vf <VF function> mac <MAC address>
where <interface> being the interface providing the virtual functions for example, eth0, <VF function> being the virtual function number, for example 0,
Currently the table implementation is not a thread safe implementation and assumes that locking between threads and processes is handled by the user's application.
This is likely to be supported in future releases.