Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests. If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.
In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.
MFC after: 3 weeks
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.
Differential Revision: https://reviews.freebsd.org/D20999
Don't needlessly pass around qpair pointers when the tracker knows what
qpair it's on. This will simplify code and make it easier to split
submission and completion queues in the future.
Signed-off-by: John Meneghini <johnm@netapp.com>
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
The NVMe standard (1.4) states
>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.
However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.
MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back. If there is a detach call in such state,
just stop all activity and free resources. If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.
Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.
The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.
Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.
Reviewed by: jhb and markj (both prior versions)
Differential Revision: https://reviews.freebsd.org/D20478
retries.
When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.
Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).
This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.
This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.
Bump __FreeBSD_version to 1200081 due to API change
Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
Summary:
Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.
Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see controller resets often. With this change, I
don't see these resets messages, though another tester still does, for yet to be
determined reasons, so this may not be a complete fix. Additionally, I see a
~5-10% speed up in buildworld times, likely due to not needing to reset the
controller.
Reviewed By: jimharris
Differential Revision: https://reviews.freebsd.org/D16570
dma_tag_payload should not be destroyed before payload_dma_map, and seems
it should be used there instead of dma_tag to match creation.
Sponsored by: iXsystems, Inc.
On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.
This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.
PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.
NVMe is now working on powerpc64 host.
Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.
Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.
Sponsored by: Netflix
Submission and completion queue memory need to use a
separate DMA tag for mappings than payload buffers,
to ensure mappings remain contiguous even with DMAR
enabled.
Submitted by: kib
MFC after: 1 week
Sponsored by: Intel
max transfer size. This guards against rogue commands coming in from
userspace.
Also add KASSERTS for the virtual address and unmapped bio cases, if the
transfer size exceeds the controller's max transfer size.
Sponsored by: Intel
MFC after: 3 days
Also allow admin commands to transfer up to this maximum I/O size, rather
than the artificial limit previously imposed. The larger I/O size is very
beneficial for upcoming firmware download support. This has the added
benefit of simplifying the code since both admin and I/O commands now use
the same maximum I/O size.
Sponsored by: Intel
MFC after: 3 days
NULL. This simplifies decisions around if/how requests are routed through
busdma. It also paves the way for supporting unmapped bios.
Sponsored by: Intel
start or reset. Also add a notifier for NVMe consumers for controller fail
conditions and plumb this notifier for nvd(4) to destroy the associated
GEOM disks when a failure occurs.
This requires a bit of work to cover the races when a consumer is sending
I/O requests to a controller that is transitioning to the failed state. To
help cover this condition, add a task to defer completion of I/Os submitted
to a failed controller, so that the consumer will still always receive its
completions in a different context than the submission.
Sponsored by: Intel
Reviewed by: carl