When building ENA as compiled into the kernel, the driver would fail to
build. Resolve the problem by introducing the following changes:
1. Add missing `ena_rss.c` entry in `sys/conf/files`.
2. Prevent SYSCTL_ADD_INT from throwing an assert due to an extra
CTLTYPE_INT flag.
Fixes: 986e7b9227 ("ena: Move RSS logic into its own source files")
Fixes: 6d1ef2abd3 ("ena: Implement full RSS reconfiguration")
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
MFC after: 1 week
Some of the changes in this release:
* Hardware RSS hash key reconfiguration and indirection table
reconfiguration support.
* Full kernel RSS support.
* Extra statistic counters.
* Netmap support for ENAv3.
* Locking assertions.
* Extra log messages.
* Reset handling fixes.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Bind RX/TX queues and MSI-X vectors to matching CPUs based on the RSS
bucket entries.
Introduce sysctls for the following RSS functionality:
- rss.indir_table: indirection table mapping
- rss.indir_table_size: indirection table size
- rss.key: RSS hash key (if Toeplitz used)
Said sysctls are only available when compiled without `option RSS`, as
kernel-side RSS support currently doesn't offer RSS reconfiguration.
Migrate the hash algorithm from CRC32 to Toeplitz and change the initial
hash value to 0x0 in order to match the standard Toeplitz implementation.
Provide helpers for hash key inversion required for HW operations.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Provide the following sysctl statistics in order to stay aligned with
the Linux driver:
* rx_ring.csum_good
* tx_ring.unmask_interrupt_num
Also rename the 'bad_csum' statistic name to 'csum_bad' for alignment.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
In order to use `ena_global_lock` in sysctl context, it must be kept
outside the driver instance's software context, as sysctls can be called
before attach and after detach, leading to lock use before sx_init and
after sx_destroy otherwise.
Solve this issue by turning `ena_global_lock` into a file scope
variable, shared between all instances of the driver and associated
sysctl context, and in turn initialized/destroyed in dedicated
SYSINIT/SYSUNINIT functions.
As a side effect, this change also fixes existing race in the reset
routine, when simultaneously accessing sysctl exposed properties.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
If LLQ is being used, `ena_tx_ctx.meta_valid` must stay enabled. This
fixes netmap support on latest generation ENA HW and aligns it with the
core driver behavior.
As netmap doesn't support any csum offloads, the
`adapter->disable_meta_caching` value can be simply passed to the HW.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Delegate RSS related functionality into separate .c/.h files in
preparation for the full RSS support.
While at it, reorder functions and remove prototypes for ones with
internal linkage.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
ENA silently assumed that ena_up, ena_down and ena_start_xmit routines
should be called within locked context. Driver's logic heavily assumes
on concurrent access to those routines, so for safety and better
documentation about this assumption, the locking assertions were added
to the above functions.
The assertion was added only for the main steps (skipping the helper
functions) which can be called from multiple places including the kernel
and the driver itself.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Stay aligned with the Linux driver by adding the following logs:
* inform the user about retrying queue creation
* warn on non-empty ena_tx_buffer.mbuf prior to ena_tx_map_mbuf
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Check for ENA_FLAG_TRIGGER_RESET inside a locked context in order to
avoid potential race conditions with ena_destroy_device. This aligns the
reset task logic with the Linux driver.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
All ena_com_prepare_tx errors other than ENA_COM_NO_MEM are fatal and
require device reset.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
In case of Low-latency Queue, one small enough descriptor can be pushed
directly to the ENA hw, thus saving one fragment. Check for this
condition before performing collapse.
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Some of the changes in this release:
* Large LLQ headers,
* Bug/stability fixes,
* Change of the README/Documentation.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Update the driver in order not to break its compilation
and make use of the new ENA logging system
Migrate platform code to the new logging system provided by ena_com
layer.
Make ENA_INFO the new default log level.
Remove all explicit use of `device_printf`, all new logs requiring one
of the log macros to be used.
IO queue related attributes are registered statically at driver attach
with the rest of the ENA specific sysctl nodes. However, the number of
queues can be changed at runtime via the `ena_sysctl_io_queues_nb`
request, leading to a potential exposure of attributes for non-existing
queues.
Introduce a new `ena_sysctl_update_queue_node_nb` function, which
updates the sysctl nodes after the number of queues is altered.
This happens by either registering or unregistering node specific oids,
based on a delta between the previous and current queue count.
NOTE: All unregistered oids must be registered again before the driver
detach, e.g. by another call to this function.
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Calling free on a NULL pointer is valid, as appropriate check is already
done internally:
/* free(NULL, ...) does nothing */
if (addr == NULL)
return;
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Default LLQ (Low-latency queue) maximum header size is 96 bytes and can
be too small for some types of packets - like IPv6 packets with multiple
extension. This can be fixed, by using large LLQ headers.
If the device supports larger LLQ headers, the user can activate this
feature by setting sysctl tunable 'hw.ena.force_large_llq_header' to '1'
in the /boot/loader.conf file.
In case the device isn't supporting this feature, the default value (96B)
will be used.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
According to man style(9), only C-style comments should be used.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
In the new ENA-based instances like c6gn, the vector table moved to a
new PCIe bar - BAR1. Previously it was always located on the BAR0, so
the resources were already allocated together with the registers.
As the FreeBSD isn't doing any resource allocation behind the scenes,
the driver is responsible to allocate them explicitly, before other
parts of the OS (like the PCI code allocating MSIx) will be able to
access them.
To determine dynamically BAR on which the MSIx vector table is present
the pci_msix_table_bar() is being used and the new BAR is allocated if
needed.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc
MFC after: 3 days
Some of the PCI ID were described as ENA with LLQ support - it's not
fully accurate and because of that, their names were changed.
Instead of LLQ, use RSERV0 for the description of those devices.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27119
The new HAL allows the driver to read extra ENI stats. Exact meaning of
each of them can be found in base/ena_defs/ena_admin_defs.h file and
structure ena_admin_eni_stats.
Those stats are being updated inside of the timer service, which is
executed every second.
ENI metrics are turned off by default. They can be enabled, using the
sysctl node: dev.ena.X.eni_metrics.update_delay
0 value in this node means that the update is turned off. Other values
determine how many seconds must pass, before ENI metrics will be
updated.
They can be acquired, using sysctl:
sysctl dev.ena.X.eni_metrics
Where X stands for the interface number.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27118
Refering to guide: https://wiki.freebsd.org/SPDX the SPDX tag should not
replace the standard license text, however it should be added over the
standard license text to make the automation easier.
Because of that, the old license was kept, but the SPDX tag was added
on top of every ENA driver file.
Submited by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27117
For the first descriptor in a chain the data may start at an offset.
It is optional feature of some devices, so the driver must ack that
it supports it.
The data pointer of the mbuf is simply shifted by the given value.
Submitted by: Maciej Bielski <mba@semihalf.com>
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27116
* Use the new API of ena_trace_*
* Fix typo syndrom --> syndrome
* Remove validation of the Rx req ID (already performed in the ena-com)
* Remove usage of deprecated ENA_ASSERT macro
Submitted by: Ido Segev <idose@amazon.com>
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27115
The latest generation hardware requires IO CQ (completion queue)
descriptors memory to be aligned to a 4K. It needs that feature for
the best performance.
Allocating unaligned descriptors will have a big performance impact as
the packet processing in a HW won't be optimized properly. For that
purpose adjust ena_dma_alloc() to support it.
It's a critical fix, especially for the arm64 EC2 instances.
Submitted by: Ido Segev <idose@amazon.com>
Obtained from: Amazon, Inc
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27114
Networking is broken if the driver configures its (virtual) hardware to
use a hash algorithm (or a key) different from the one that the network
stack (software RSS) uses. This can be seen with connections initiated
from the host. The PCB will be placed into the hash table based on the
hash value calculated by the software. The hardware-calculated hash
value in reponse packets will be different, so the PCB won't be found.
Tested with a kernel compiled with 'options RSS' on an instance with ena
driver.
Reviewed by: mw, adrian
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D24733
Driver version upgrade is connected with support for the new device
fetures, like Tx drops reporting or disabling meta caching.
Moreover, the driver configuration from the sysctl was reworked to
provide safer and better flow for configuring:
* number of IO queues (new feature),
* drbr size on Tx,
* Rx queue size.
Moreover, a lot of minor bug fixes and improvements were added.
Copyright date in the license of the modified files in this release was
updated to 2020.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
There is no guarantee from bus_dmamap_load_mbuf_sg() for matching
mbuf chain segments to dma physical segments.
This patch ensure correctly mapping to LLQ header and DMA segments.
Submitted by: Ido Segev <idose@amazon.com>
Obtained from: Amazon, Inc.
There is ena_free_all_io_rings_resources() called twice on device
detach:
ena_detach():
ena_destroy_device():
/* First call */
ena_free_all_io_rings_resources()
/* Second call */
ena_free_all_io_rings_resources()
The double-free causes panic() on kldunload, for example.
As the ena_destroy_device() is also called by ena_reset_task() it is
better to stay unchanged. Thus, remove the "Second call" of the function.
Submitted by: Maciej Bielski <mba@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Determined by a flag passed from the device. No metadata is set within
ena_tx_csum when caching is disabled.
Submitted by: Maciej Bielski <mba@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
If requested size of IO queues is not supported try to decrease it until
finding the highest value that can be satisfied.
Submitted by: Maciej Bielski <mba@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
By default, in ena_attach() the driver attempts to acquire
ena_adapter::max_num_io_queues MSI-X vectors for the purpose of IO
queues, however this is not guaranteed. The number of vectors acquired
depends also on system resources availability.
Regardless of that, enable the number of effectively used IO queues to
be further limited through the sysctl node.
Example: Assumming that there are 8 IO queues configured by default, the
command
$ sysctl dev.ena.0.io_queues_nb=4
will reduce the number of available IO queues to 4. Similarly, the value
can be also increased up to maximum supported value. A value higher than
maximum supported number of IO queues is ignored. Zero is ignored too.
Submitted by: Maciej Bielski <mba@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Make the ena_adapter::num_io_queues a number of effectively used IO
queues. While the ena_adapter::max_num_io_queues is an upper-bound
specified by the HW, the ena_adapter::num_io_queues may be lower than
that, depending on runtime system resources availability.
On reset, there are called ena_destroy_device() and then
ena_restore_device(). The latter calls, in turn, ena_enable_msix(),
which will attempt to re-acquire ena_adapter::max_num_io_queues of
MSIX vectors again.
Thus, the value of ena_adapter::num_io_queues may be different before
and after reset. For this reason, free the IO rings structures (drbr,
counters) in ena_destroy_device() and allocate again in
ena_restore_device().
Submitted by: Maciej Bielski <mba@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
This method has been aligned with the way how the Rx queue size is being
updated - so it's now done synchronously instead of resetting the
device.
Moreover, the input parameter is now being validated if it's a power of
2. Without this, it can cause kernel panic.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
This patch reworks how the Rx queue size is being reconfigured and how
the information from the device is being processed.
Reconfiguration of the queues and reset of the device in order to make
the changes alive isn't the best approach. It can be done synchronously
and it will let to pass information if the reconfiguration was
successful to the user. It now is done in the ena_update_queue_size()
function.
To avoid reallocation of the ring buffer, statistic counters and the
reinitialization of the mutexes when only new size has to be assigned,
the io queues initialization function has been split into 2 stages:
basic, which is just copying appropriate fields and the advanced, which
allocates and inits more advanced structures for the IO rings.
Moreover, now the max allowed Rx and Tx ring size is being kept
statically in the adapter and the size of the variables holding those
values has been changed to uint32_t everywhere.
Information about IO queues size is now being logged in the up routine
instead of the attach.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Recent changes to the epoch requires driver to notify that they knows
epoch in order to prevent input packet function to enter epoch each
time the packet is received.
ENA is using NET_TASK for handling Rx, so it's entering epoch
automatically whenever this task is being executed.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
If the conditional check for ENA_FLAG_DEV_UP is negated, the body of the
function can have smaller indentation and it makes the code cleaner.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
As functions which are declared in the header files are intended to be
the interface and are going to be used by other files, it's better to
include argument names in the definition, so the caller won't have to
check the .c file in order to check their meaning and order.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Currently, the driver had 2 global locks - one was sx lock used for
up/down synchronization and the second one was mutex, which was used
for link configuration and timer service callout.
It is better to have single lock for that. We cannot use mutex, as it
can sleep and cause witness errors in up/down configuration, so sx lock
seems to be the only choice.
Callout cannot use sx lock, but the timer service is MP safe, so we just
need to avoid race between ena_down() and ena_detach(). It can be
avoided by acquiring sx lock.
Simple macros were added that are encapsulating implementation of the
lock and makes the code cleaner.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
As the reset triggering is no longer a simple macro that was just
setting appropriate flag, the new function for triggering reset was
added. It improves code readability a lot, as we are avoiding additional
indentation.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
The function ena_enable_msix_and_set_admin_interrupts takes two
arguments while the second is not used and so can be spared. This is a
static function, only ena.c is affected.
Submitted by: Maciej Bielski <mba@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Tx drops statistics are fetched from HW every ena_keepalive_wd() call
and are observable using one of the commands:
* sysctl dev.ena.0.hw_stats.tx_drops
* netstat -I ena0 -d
Submitted by: Maciej Bielski <mba@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
* Removed adaptive interrupt moderation (not suported on FreeBSD).
* Use ena_com_free_q_entries instead of ena_com_free_desc.
* Don't use ENA_MEM_FREE outside of the ena_com.
* Don't use barriers before calling doorbells as it's already done in
the HAL.
* Add function that generates random RSS key, common for all driver's
interfaces.
* Change admin stats sysctls to U64.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Sometimes, especially when there is not much memory in the system left,
allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
and it is not guaranteed that it'll succeed. In that situation, the
fallback will work, but if the refill needs to take a place for a lot of
descriptors at once, the time spent in m_getjcl looking for memory can
cause system unresponsiveness due to high priority of the Rx task. This
can also lead to driver reset, because Tx cleanup routine is being
blocked and timer service could detect that Tx packets aren't cleaned
up. The reset routine can further create another unresponsiveness - Rx
rings are being refilled there, so m_getjcl will again burn the CPU.
This was causing NVMe driver timeouts and resets, because network driver
is having higher priority.
Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
more space than 9KB will be needed.
However, 9KB jumbo clusters can still cause issues, so by default the
page size mbuf cluster will be used for the Rx descriptors. This can have a
small (~2%) impact on the throughput of the device, so to restore
original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
"1" in "/boot/loader.conf" file.
As a part of this patch (important fix), the version of the driver
was updated to v2.1.2.
Submitted by: cperciva
Reviewed by: Michal Krawczyk <mk@semihalf.com>
Reviewed by: Ido Segev <idose@amazon.com>
Reviewed by: Guy Tzalik <gtzalik@amazon.com>
MFC after: 3 days
PR: 225791, 234838, 235856, 236989, 243531
Differential Revision: https://reviews.freebsd.org/D24546
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
Driver working in LLQ mode in some cases can send only few last segments
of the mbuf using DMA engine, and the rest of them are sent to the
device using direct PCI transaction. To map the only necessary data, two DMA
maps were used. That solution was very rough and was causing a bug - if
both maps were used (head_map and seg_map), there was a race in between
two flows on two queues and the device was receiving corrupted
data which could be further received on the other host if the Tx cksum
offload was enabled.
As it's ok to map whole mbuf and then send to the device only needed
segments, the design was simplified to use only single DMA map.
The driver version was updated to v2.1.1 as it's important bug fix.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.