Commit Graph

85 Commits

Author SHA1 Message Date
Warner Losh
9600aa31aa nvme: use NVME_GONE rather than hard-coded 0xffffffff
Make it clearer that the value 0xfffffff is being used to detect the device is
gone. We use it other places in the driver for other meanings.
2021-02-08 13:08:48 -07:00
Warner Losh
082905cad1 nvme: Remove a wmb() that's not necessary.
bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can
be DMA'd immediately after it returns. The details differ, but this
mirrors atomic thread release semantics, at least for the buffers
synced.

For non-x86 platforms, bus_dmamap_sync() has the right syncing and
fences. So in the past, wmb() had been omitted for them.

For x86 platforms, the memory ordering is already strong enough to
ensure DMA to the device sees the current contents. As such, we don't
need the wmb() here. It translates to an sfence which is only needed
for writes to regions that have the write combining attribute set or
when some exotic opcodes are used. The nvme driver does neither of
these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we
can be assured any optimizer won't reorder the bus_dmamap_sync and the
bus_space_write operations. The wmb() was a vestiage of the pre-busdma
version initially committed to the tree.

Reviewed by: kib@, gallatin@, chuck@, mav@
Differential Revision: https://reviews.freebsd.org/D27448
2020-12-04 21:34:48 +00:00
Michal Meloun
8f9d5a8dbf NVME: Multiple busdma related fixes.
- in nvme_qpair_process_completions() do dma sync before completion buffer
  is used.
- in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm
  and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure
  that all CPU stores are visible to external (including DMA) observers.
- Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems,
  buffers continuously owned (and accessed) by DMA must be allocated with this
  flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems
  (or coherent buses in mixed systems).

MFC after:	4 weeks
Reviewed by:	mav, imp
Differential Revision: https://reviews.freebsd.org/D27446
2020-12-02 16:54:24 +00:00
Chuck Tuffli
8d08cdc721 nvme: Fix typo in definition
Change occurrences of "selt test" to "self tests in the NVMe header
file.

Reviewed by:	imp, mav
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D27439
2020-12-02 15:59:08 +00:00
Alexander Motin
ac90f70d1e Increase nvme(4) maximum transfer size from 1MB to 2MB.
With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.

On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs.  The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.

Reviewed by:	imp
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-11-29 00:20:31 +00:00
Mateusz Guzik
d87b31e159 nvme: clean up empty lines in .c and .h files 2020-09-01 22:03:10 +00:00
Mark Johnston
96ad26eefb Remove free_domain() and uma_zfree_domain().
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches.  They no longer serve any
purpose since UMA now provides their functionality by default.  Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by:	cem, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25937
2020-08-04 13:58:36 +00:00
Alexander Motin
ead7e10308 Make polled request timeout less invasive.
Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests.  If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.

In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-18 19:16:03 +00:00
Alexander Motin
550d5d64fe Fix admin qpair leak if detached during initial reset.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-17 17:51:40 +00:00
David Bright
4053f8ac4d Fix various Coverity-detected errors in nvme driver
This fixes several Coverity-detected errors in the nvme driver.

CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975,
1403980

Reviewed by:	imp@, vangyzen@
MFC after:	5 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D24532
2020-05-02 20:47:58 +00:00
Ed Maste
aeb665b538 remove extraneous double ;s in sys/ 2020-03-30 16:04:25 +00:00
Michal Meloun
0a4b14e8cc Properly synchronize completion DMA buffers.
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.

MFC after:	3 weeks
2019-12-15 14:28:38 +00:00
Warner Losh
7588c6cc36 Move to using bool instead of boolean_t
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.

Differential Revision: https://reviews.freebsd.org/D20999
2019-12-13 18:35:48 +00:00
Warner Losh
43393e8b2c trackers always know what qpair they are on
Don't needlessly pass around qpair pointers when the tracker knows what
qpair it's on.  This will simplify code and make it easier to split
submission and completion queues in the future.

Signed-off-by: John Meneghini <johnm@netapp.com>
2019-12-06 22:12:39 +00:00
Alexander Motin
1eab19cbec Make nvme(4) driver some more NUMA aware.
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
 - Allocate most of queue pair memory from the domain it is bound to.
 - Bind callouts to the same CPUs as queue pair to avoid migrations.
 - Do not assign queue pairs to each SMT thread.  It just wasted
resources and increased lock congestions.
 - Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
 - If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-23 17:53:47 +00:00
Warner Losh
f93b7f954e Support doorbell strides != 0.
The NVMe standard (1.4) states

>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ...  For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.

However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.

MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
2019-09-04 20:08:36 +00:00
Alexander Motin
71a2818142 Improve NVMe hot unplug handling.
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back.  If there is a detach call in such state,
just stop all activity and free resources.  If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-08-21 20:17:30 +00:00
Alexander Motin
a6d222eb68 Add more random bits from NVMe 1.4.
MFC after:	2 weeks
2019-08-03 02:36:35 +00:00
Alexander Motin
90dfa8f0ac Add more new fields and values from NVMe 1.4.
MFC after:	2 weeks
2019-08-02 03:43:24 +00:00
Warner Losh
5e83c2ffaa Keep track of the number of commands that exhaust their retry limit.
While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.
2019-07-19 18:39:24 +00:00
Warner Losh
c37fc318c4 Keep track of the number of retried commands.
Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.
2019-07-19 18:39:18 +00:00
Warner Losh
c75bdc044d Provide new tunable hw.nvme.verbose_cmd_dump
The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.

Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988
2019-07-18 21:58:51 +00:00
Warner Losh
d0aaeffdb4 Since a fatal trap can happen at aribtrary times, don't panic when the
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.

Reviewed by: jhb and markj (both prior versions)
Differential Revision:  https://reviews.freebsd.org/D20478
2019-06-01 15:37:44 +00:00
Warner Losh
2ffd6fce5b Don't print all the I/O we abort on a reset, unless we're out of
retries.

When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.

Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
2019-03-09 01:18:16 +00:00
Warner Losh
95108cadbc Add ABORTED_BY_REQUEST to the list of things we look at DNR bit and tell why to comment (code already does this) 2019-03-03 03:36:33 +00:00
Warner Losh
45d7e233a5 Unconditionally support unmapped BIOs. This was another shim for
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
2019-02-27 22:16:59 +00:00
Warner Losh
d706306d49 Remove #ifdef code to support FreeBSD versions that haven't been
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.
2019-02-27 22:05:01 +00:00
Alexander Motin
a646135771 Add descriptions to NVMe interrupts.
MFC after:	1 month
2018-12-26 23:41:52 +00:00
Chuck Tuffli
9544e6dcf1 Make NVMe compatible with the original API
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).

This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.

This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.

Bump __FreeBSD_version to 1200081 due to API change

Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
2018-08-22 04:29:24 +00:00
Justin Hibbits
2e0090af65 nvme(4): Add bus_dmamap_sync() at the end of the request path
Summary:
Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.

Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see controller resets often.  With this change, I
don't see these resets messages, though another tester still does, for yet to be
determined reasons, so this may not be a complete fix.  Additionally, I see a
~5-10% speed up in buildworld times, likely due to not needing to reset the
controller.

Reviewed By: jimharris
Differential Revision: https://reviews.freebsd.org/D16570
2018-08-03 20:04:06 +00:00
Alexander Motin
c6c70c0746 Fix use-after-free in nvme_qpair_destroy().
dma_tag_payload should not be destroyed before payload_dma_map, and seems
it should be used there instead of dma_tag to match creation.

Sponsored by:	iXsystems, Inc.
2018-04-30 21:28:10 +00:00
Warner Losh
d85d964829 Try polling the qpairs on timeout.
On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.

This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.

PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix
2018-03-16 05:23:48 +00:00
Alexander Motin
6b1a96b16b Add new opcodes and statuses from NVMe 1.3a.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2018-03-11 06:30:09 +00:00
Wojciech Macek
0d787e9b35 NVMe: Add big-endian support
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           imp, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
2018-02-22 13:32:31 +00:00
Pedro F. Giffuni
718cf2ccb9 sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
Warner Losh
519772814d Add CAM/NVMe support for CAM_DATA_SG
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.

Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
2017-08-29 15:29:57 +00:00
Warner Losh
824073fbd6 Avoid dereferencing unintialized elements in the error path.
Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.

Sponsored by: Netflix
2017-03-07 23:06:41 +00:00
Scott Long
a965389b5a Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a
bunch of safery belts and error handling in related codepaths.

Reviewed by:	jimharris
Obtained from:	Netflix
Differential Revision:	D8453
2016-11-08 00:24:49 +00:00
Jim Harris
e5af5854ff nvme: do not pre-allocate MSI-X IRQ resources
The issue referenced here was resolved by other changes
in recent commits, so this code is no longer needed.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:11:31 +00:00
Jim Harris
3345ed9a55 nvme: use BUS_SPACE_MAXSIZE for bus_dma_tag_create maxsize parameter
This fixes i386 PAE build fallout from r281281.

Reported by:	bz
MFC after:	1 week
2015-04-09 00:37:55 +00:00
Jim Harris
36b0e4ee1f nvme: remove CHATHAM related code
Chatham was an internal NVMe prototype board used for
early driver development.

MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:52:06 +00:00
Jim Harris
a6e3096392 nvme: create separate DMA tag for non-payload DMA buffers
Submission and completion queue memory need to use a
separate DMA tag for mappings than payload buffers,
to ensure mappings remain contiguous even with DMAR
enabled.

Submitted by:	kib
MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:49:45 +00:00
Jim Harris
f42ca756b9 nvme: Allocate all MSI resources up front so that we can fall back to
INTx if necessary.

Sponsored by:	Intel
MFC after:	3 days
2014-03-18 18:10:35 +00:00
Jim Harris
1416ef361e nvme: NVMe specification dictates 4-byte alignment for PRPs (not 8).
Sponsored by:	Intel
MFC after:	3 days
2014-03-17 22:37:17 +00:00
Jim Harris
e9efbc134f Update copyright dates.
MFC after:	3 days
2013-07-09 21:22:17 +00:00
Jim Harris
bbd412dd05 Remove remaining uio-related code.
The nvme_physio() function was removed quite a while ago, which was the
only user of this uio-related code.

Sponsored by:	Intel
MFC after:	3 days
2013-06-26 23:37:11 +00:00
Jim Harris
7b68ae1e5e Fail any passthrough command whose transfer size exceeds the controller's
max transfer size.  This guards against rogue commands coming in from
userspace.

Also add KASSERTS for the virtual address and unmapped bio cases, if the
transfer size exceeds the controller's max transfer size.

Sponsored by:	Intel
MFC after:	3 days
2013-06-26 23:32:45 +00:00
Jim Harris
8d09e3c400 Use MAXPHYS to specify the maximum I/O size for nvme(4).
Also allow admin commands to transfer up to this maximum I/O size, rather
than the artificial limit previously imposed.  The larger I/O size is very
beneficial for upcoming firmware download support.  This has the added
benefit of simplifying the code since both admin and I/O commands now use
the same maximum I/O size.

Sponsored by:	Intel
MFC after:	3 days
2013-06-26 23:27:17 +00:00
Jim Harris
ca269f32ef Move the busdma mapping functions to nvme_qpair.c.
This removes nvme_uio.c completely.

Sponsored by:	Intel
2013-04-12 17:48:45 +00:00
Jim Harris
e2b9900498 Do not panic when a busdma mapping operation fails.
Instead, print an error message and fail the associated command with
DATA_TRANSFER_ERROR NVMe completion status.

Sponsored by:	Intel
2013-04-12 17:34:49 +00:00