Commit Graph

74 Commits

Author SHA1 Message Date
Michal Meloun
0a4b14e8cc Properly synchronize completion DMA buffers.
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.

MFC after:	3 weeks
2019-12-15 14:28:38 +00:00
Warner Losh
7588c6cc36 Move to using bool instead of boolean_t
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.

Differential Revision: https://reviews.freebsd.org/D20999
2019-12-13 18:35:48 +00:00
Warner Losh
43393e8b2c trackers always know what qpair they are on
Don't needlessly pass around qpair pointers when the tracker knows what
qpair it's on.  This will simplify code and make it easier to split
submission and completion queues in the future.

Signed-off-by: John Meneghini <johnm@netapp.com>
2019-12-06 22:12:39 +00:00
Alexander Motin
1eab19cbec Make nvme(4) driver some more NUMA aware.
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
 - Allocate most of queue pair memory from the domain it is bound to.
 - Bind callouts to the same CPUs as queue pair to avoid migrations.
 - Do not assign queue pairs to each SMT thread.  It just wasted
resources and increased lock congestions.
 - Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
 - If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-23 17:53:47 +00:00
Warner Losh
f93b7f954e Support doorbell strides != 0.
The NVMe standard (1.4) states

>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ...  For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.

However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.

MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
2019-09-04 20:08:36 +00:00
Alexander Motin
71a2818142 Improve NVMe hot unplug handling.
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back.  If there is a detach call in such state,
just stop all activity and free resources.  If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-08-21 20:17:30 +00:00
Alexander Motin
a6d222eb68 Add more random bits from NVMe 1.4.
MFC after:	2 weeks
2019-08-03 02:36:35 +00:00
Alexander Motin
90dfa8f0ac Add more new fields and values from NVMe 1.4.
MFC after:	2 weeks
2019-08-02 03:43:24 +00:00
Warner Losh
5e83c2ffaa Keep track of the number of commands that exhaust their retry limit.
While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.
2019-07-19 18:39:24 +00:00
Warner Losh
c37fc318c4 Keep track of the number of retried commands.
Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.
2019-07-19 18:39:18 +00:00
Warner Losh
c75bdc044d Provide new tunable hw.nvme.verbose_cmd_dump
The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.

Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988
2019-07-18 21:58:51 +00:00
Warner Losh
d0aaeffdb4 Since a fatal trap can happen at aribtrary times, don't panic when the
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.

Reviewed by: jhb and markj (both prior versions)
Differential Revision:  https://reviews.freebsd.org/D20478
2019-06-01 15:37:44 +00:00
Warner Losh
2ffd6fce5b Don't print all the I/O we abort on a reset, unless we're out of
retries.

When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.

Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
2019-03-09 01:18:16 +00:00
Warner Losh
95108cadbc Add ABORTED_BY_REQUEST to the list of things we look at DNR bit and tell why to comment (code already does this) 2019-03-03 03:36:33 +00:00
Warner Losh
45d7e233a5 Unconditionally support unmapped BIOs. This was another shim for
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
2019-02-27 22:16:59 +00:00
Warner Losh
d706306d49 Remove #ifdef code to support FreeBSD versions that haven't been
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.
2019-02-27 22:05:01 +00:00
Alexander Motin
a646135771 Add descriptions to NVMe interrupts.
MFC after:	1 month
2018-12-26 23:41:52 +00:00
Chuck Tuffli
9544e6dcf1 Make NVMe compatible with the original API
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).

This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.

This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.

Bump __FreeBSD_version to 1200081 due to API change

Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
2018-08-22 04:29:24 +00:00
Justin Hibbits
2e0090af65 nvme(4): Add bus_dmamap_sync() at the end of the request path
Summary:
Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.

Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see controller resets often.  With this change, I
don't see these resets messages, though another tester still does, for yet to be
determined reasons, so this may not be a complete fix.  Additionally, I see a
~5-10% speed up in buildworld times, likely due to not needing to reset the
controller.

Reviewed By: jimharris
Differential Revision: https://reviews.freebsd.org/D16570
2018-08-03 20:04:06 +00:00
Alexander Motin
c6c70c0746 Fix use-after-free in nvme_qpair_destroy().
dma_tag_payload should not be destroyed before payload_dma_map, and seems
it should be used there instead of dma_tag to match creation.

Sponsored by:	iXsystems, Inc.
2018-04-30 21:28:10 +00:00
Warner Losh
d85d964829 Try polling the qpairs on timeout.
On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.

This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.

PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix
2018-03-16 05:23:48 +00:00
Alexander Motin
6b1a96b16b Add new opcodes and statuses from NVMe 1.3a.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2018-03-11 06:30:09 +00:00
Wojciech Macek
0d787e9b35 NVMe: Add big-endian support
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           imp, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
2018-02-22 13:32:31 +00:00
Pedro F. Giffuni
718cf2ccb9 sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
Warner Losh
519772814d Add CAM/NVMe support for CAM_DATA_SG
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.

Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
2017-08-29 15:29:57 +00:00
Warner Losh
824073fbd6 Avoid dereferencing unintialized elements in the error path.
Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.

Sponsored by: Netflix
2017-03-07 23:06:41 +00:00
Scott Long
a965389b5a Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a
bunch of safery belts and error handling in related codepaths.

Reviewed by:	jimharris
Obtained from:	Netflix
Differential Revision:	D8453
2016-11-08 00:24:49 +00:00
Jim Harris
e5af5854ff nvme: do not pre-allocate MSI-X IRQ resources
The issue referenced here was resolved by other changes
in recent commits, so this code is no longer needed.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:11:31 +00:00
Jim Harris
3345ed9a55 nvme: use BUS_SPACE_MAXSIZE for bus_dma_tag_create maxsize parameter
This fixes i386 PAE build fallout from r281281.

Reported by:	bz
MFC after:	1 week
2015-04-09 00:37:55 +00:00
Jim Harris
36b0e4ee1f nvme: remove CHATHAM related code
Chatham was an internal NVMe prototype board used for
early driver development.

MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:52:06 +00:00
Jim Harris
a6e3096392 nvme: create separate DMA tag for non-payload DMA buffers
Submission and completion queue memory need to use a
separate DMA tag for mappings than payload buffers,
to ensure mappings remain contiguous even with DMAR
enabled.

Submitted by:	kib
MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:49:45 +00:00
Jim Harris
f42ca756b9 nvme: Allocate all MSI resources up front so that we can fall back to
INTx if necessary.

Sponsored by:	Intel
MFC after:	3 days
2014-03-18 18:10:35 +00:00
Jim Harris
1416ef361e nvme: NVMe specification dictates 4-byte alignment for PRPs (not 8).
Sponsored by:	Intel
MFC after:	3 days
2014-03-17 22:37:17 +00:00
Jim Harris
e9efbc134f Update copyright dates.
MFC after:	3 days
2013-07-09 21:22:17 +00:00
Jim Harris
bbd412dd05 Remove remaining uio-related code.
The nvme_physio() function was removed quite a while ago, which was the
only user of this uio-related code.

Sponsored by:	Intel
MFC after:	3 days
2013-06-26 23:37:11 +00:00
Jim Harris
7b68ae1e5e Fail any passthrough command whose transfer size exceeds the controller's
max transfer size.  This guards against rogue commands coming in from
userspace.

Also add KASSERTS for the virtual address and unmapped bio cases, if the
transfer size exceeds the controller's max transfer size.

Sponsored by:	Intel
MFC after:	3 days
2013-06-26 23:32:45 +00:00
Jim Harris
8d09e3c400 Use MAXPHYS to specify the maximum I/O size for nvme(4).
Also allow admin commands to transfer up to this maximum I/O size, rather
than the artificial limit previously imposed.  The larger I/O size is very
beneficial for upcoming firmware download support.  This has the added
benefit of simplifying the code since both admin and I/O commands now use
the same maximum I/O size.

Sponsored by:	Intel
MFC after:	3 days
2013-06-26 23:27:17 +00:00
Jim Harris
ca269f32ef Move the busdma mapping functions to nvme_qpair.c.
This removes nvme_uio.c completely.

Sponsored by:	Intel
2013-04-12 17:48:45 +00:00
Jim Harris
e2b9900498 Do not panic when a busdma mapping operation fails.
Instead, print an error message and fail the associated command with
DATA_TRANSFER_ERROR NVMe completion status.

Sponsored by:	Intel
2013-04-12 17:34:49 +00:00
Jim Harris
5fdf9c3c8e Add unmapped bio support to nvme(4) and nvd(4).
Sponsored by:	Intel
2013-04-01 16:23:34 +00:00
Jim Harris
1e526bc478 Add "type" to nvme_request, signifying if its payload is a VADDR, UIO, or
NULL. This simplifies decisions around if/how requests are routed through
busdma.  It also paves the way for supporting unmapped bios.

Sponsored by:	Intel
2013-03-29 20:34:28 +00:00
Jim Harris
bdd1fd402c Fix printf format issue on i386.
Reported by:	bz
2013-03-27 00:37:00 +00:00
Jim Harris
547d523eb8 Clean up debug prints.
1) Consistently use device_printf.
2) Make dump_completion and dump_command into something more
    human-readable.

Sponsored by:	Intel
Reviewed by:	carl
2013-03-26 22:17:10 +00:00
Jim Harris
237d2019e5 Change a number of malloc(9) calls to use M_WAITOK instead of
M_NOWAIT.

Sponsored by:	Intel
Suggested by:	carl
Reviewed by:	carl
2013-03-26 22:11:34 +00:00
Jim Harris
43a3725688 Abort and do not retry any outstanding admin commands left over after
a controller reset.

Sponsored by:	Intel
Reviewed by:	carl
2013-03-26 22:06:05 +00:00
Jim Harris
232e2edb6c Add the ability to internally mark a controller as failed, if it is unable to
start or reset.  Also add a notifier for NVMe consumers for controller fail
conditions and plumb this notifier for nvd(4) to destroy the associated
GEOM disks when a failure occurs.

This requires a bit of work to cover the races when a consumer is sending
I/O requests to a controller that is transitioning to the failed state.  To
help cover this condition, add a task to defer completion of I/Os submitted
to a failed controller, so that the consumer will still always receive its
completions in a different context than the submission.

Sponsored by:	Intel
Reviewed by:	carl
2013-03-26 21:58:38 +00:00
Jim Harris
3d7eb41c1b Just disable the controller instead of deleting IO queues during detach.
This is just as effective, and removes the need for a bunch of admin commands
to a controller that's going to be disabled shortly anyways.

Sponsored by:	Intel
Reviewed by:	carl
2013-03-26 21:48:41 +00:00
Jim Harris
cb5b7c1304 Cap the number of retry attempts to a configurable number. This ensures
that if a specific I/O repeatedly times out, we don't retry it indefinitely.

The default number of retries will be 4, but is adjusted using hw.nvme.retry_count.

Sponsored by:	Intel
Reviewed by:	carl
2013-03-26 21:14:51 +00:00
Jim Harris
cf81529ce3 Create struct nvme_status.
NVMe error log entries include status, so breaking this out into
its own data structure allows it to be included in both the
nvme_completion data structure as well as error log entry data
structures.

While here, expose nvme_completion_is_error(), and change all of
the places that were explicitly looking at sc/sct bits to use this
macro instead.

Sponsored by:	Intel
Reviewed by:	carl
2013-03-26 21:00:18 +00:00
Jim Harris
f37c22a3bd Make nvme_ctrlr_reset a nop if a reset is already in progress.
This protects against cases where a controller crashes with multiple
I/O outstanding, each timing out and requesting controller resets
simultaneously.

While here, remove a debugging printf from a previous commit, and add
more logging around I/O that need to be resubmitted after a controller
reset.

Sponsored by:	Intel
Reviewed by:	carl
2013-03-26 20:56:58 +00:00