Commit Graph

111 Commits

Author SHA1 Message Date
David Bright
4053f8ac4d Fix various Coverity-detected errors in nvme driver
This fixes several Coverity-detected errors in the nvme driver.

CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975,
1403980

Reviewed by:	imp@, vangyzen@
MFC after:	5 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D24532
2020-05-02 20:47:58 +00:00
Warner Losh
4e6a434b6b Make sure that we get the sbuf resources we need.
Since we're calling sbuf_new with NOWAIT, make sure it can allocate a
buffer to use. Don't print anything if we can't get it.

Noticed by: rpokala
2020-04-30 00:43:11 +00:00
Warner Losh
244b805397 Generate a devctl event for interesting events
When we reset the controller, and when the controller tells us about a
critical warning, send an event.
2020-04-30 00:27:19 +00:00
Alexander Motin
b2cdfb72f4 Fix copy-paste bug in HMB free code.
MFC after:	2 weeks
X-MFC-with:	r356474
2020-01-08 18:26:23 +00:00
Alexander Motin
6de4e458fa Minor adjustments to r356474 and r356480.
Reported by:	jkim, imp
MFC after:	2 weeks
X-MFC-with:	r356474
2020-01-07 23:29:54 +00:00
Alexander Motin
1c7dd40e58 Increate HMB limit from 1% to 5%.
SSD capacity in laptops is growing faster then RAM size, so my original
guess seems too low on second thought.  Hopefully nobody will build large
array of those crappy SSDs.

MFC after:	2 weeks
X-MFC-with:	356474
2020-01-07 23:10:38 +00:00
Alexander Motin
67abaee9fc Add Host Memory Buffer support to nvme(4).
This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about
1MB per 1GB on the devices I have) for its metadata cache, significantly
improving random I/O performance.  Device reports minimal and preferable
size of the buffer.  The code limits it to 1% of physical RAM by default.
If the buffer can not be allocated or below minimal size, the device will
just have to work without it.

MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
2020-01-07 21:17:11 +00:00
Warner Losh
7588c6cc36 Move to using bool instead of boolean_t
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.

Differential Revision: https://reviews.freebsd.org/D20999
2019-12-13 18:35:48 +00:00
Warner Losh
66e5985084 Move reset to the interrutp processing stage
This trims the boot time a bit more for AWS and other platforms that have nvme
drives. There's no reason too do this inline. This has been in my tree a while,
but IIRC I talked to Jim Harris about this at one of our face to face meetings.

MFC After: 2 weeks
2019-12-11 22:51:02 +00:00
Alexander Motin
1eab19cbec Make nvme(4) driver some more NUMA aware.
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
 - Allocate most of queue pair memory from the domain it is bound to.
 - Bind callouts to the same CPUs as queue pair to avoid migrations.
 - Do not assign queue pairs to each SMT thread.  It just wasted
resources and increased lock congestions.
 - Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
 - If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-23 17:53:47 +00:00
Warner Losh
f93b7f954e Support doorbell strides != 0.
The NVMe standard (1.4) states

>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ...  For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.

However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.

MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
2019-09-04 20:08:36 +00:00
Warner Losh
4d5475613e Implement nvme suspend / resume for pci attachment
When we suspend, we need to properly shutdown the NVME controller. The
controller may go into D3 state (or may have the power removed), and
to properly flush the metadata to non-volatile RAM, we must complete a
normal shutdown. This consists of deleting the I/O queues and setting
the shutodown bit. We have to do some extra stuff to make sure we
reset the software state of the queues as well.

On resume, we have to reset the card twice, for reasons described in
the attach funcion. Once we've done that, we can restart the card. If
any of this fails, we'll fail the NVMe card, just like we do when a
reset fails.

Set is_resetting for the duration of the suspend / resume. This keeps
the reset taskqueue from running a concurrent reset, and also is
needed to prevent any hw completions from queueing more I/O to the
card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
that from the global state of the ctrlr. Wait for any pending reset to
finish. All queued I/O will get sent to the hardware as part of
nvme_ctrlr_start(), though the upper layers shouldn't send any
down. Disabling the qpairs is the other failsafe to ensure all I/O is
queued.

Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
confusion with all the other destroy functions.  It just removes the
queues in hardware, while the other _destroy_ functions tear down
driver data structures.

Split parts of the hardware reset function up so that I can
do part of the reset in suspsend. Split out the software disabling
of the qpairs into nvme_ctrlr_disable_qpairs.

Finally, fix a couple of spelling errors in comments related to
this.

Relnotes: Yes
MFC After: 1 week
Reviewed by: scottl@ (prior version)
Differential Revision: https://reviews.freebsd.org/D21493
2019-09-03 15:26:11 +00:00
Warner Losh
ab0681aac9 In all the places that we use the polled for completion interface, except crash
dump support code, move the while loop into an inline function. These aren't
done in the fast path, so if the compiler choses to not inline, any performance
hit is tiny.
2019-09-02 17:11:27 +00:00
Warner Losh
8e61280bd9 When we have errors resetting the device before we allocate the
queues, don't try to tear them down in the ctrlr_destroy
path. Otherwise, we dereference queue structures that are NULL and we
trap.

This fix is incomplete: we leak IRQ and MSI resources when this
happens. That's preferable to a crash but still should be fixed.
2019-08-22 21:56:11 +00:00
Warner Losh
f182f928db Separate the pci attachment from the rest of nvme
Nvme drives can be attached in a number of different ways. Separate out the PCI
attachment so that we can have other attachment types, like ahci and various
types of NVMeoF.

Submitted by: cognet@
2019-08-21 22:17:55 +00:00
Alexander Motin
71a2818142 Improve NVMe hot unplug handling.
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back.  If there is a detach call in such state,
just stop all activity and free resources.  If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-08-21 20:17:30 +00:00
Alexander Motin
a6d222eb68 Add more random bits from NVMe 1.4.
MFC after:	2 weeks
2019-08-03 02:36:35 +00:00
Alexander Motin
6c99d1325e Decode few more NVMe log pages.
In particular: Changed Namespace List, Commands Supported and Effects,
Reservation Notification, Sanitize Status.

Add few new arguments to `nvmecontrol log` subcommand.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-08-02 20:16:21 +00:00
Alexander Motin
a7bf63be69 Add IOCTL to translate nvdX into nvmeY and NSID.
While very useful by itself, it also makes `nvmecontrol` not depend on
hardcoded device names parsing, that in its turn makes simple to take
nvdX (and potentially any other) device names as arguments.

Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them
interchangeable for management purposes.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-08-01 21:44:07 +00:00
Warner Losh
08a607e0f3 Widen the type for to.
The timeout field in the CAPS register is defined to be 8 bits, so its type was
uint8_t. We recently started adding 1 to it to cope with rogue devices that
listed 0 timeout time (which is impossible). However, in so doing, other devices
that list 0xff (for a 2 minute timeout) were broken when adding 1
overflowed. Widen the type to be uint32_t like its source register to avoid the
issue.

Reported by: bapt@
2019-07-25 20:26:21 +00:00
Warner Losh
62d2cf1847 Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers.
These macros make places where we extract these easier to read. The shift and
mask stuff is also a bit tedious and error prone. Start with the CAP_LO and
CAP_HI registers since their scope is somewhat constrained. This is style
chagne only, no functional changes.

Reviewed by: chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20979
2019-07-18 15:41:10 +00:00
Warner Losh
dc9df3a59d Assume that the timeout value from the capacity is 1-based
Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1
costs little and copes with those NVMe drives that report '0' in this field
cheaply. This is consistent with what the Linux driver does as well.
2019-07-16 22:55:30 +00:00
Warner Losh
9835d216d8 rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairs
Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to
match init/uninit scenarios.

Signed-off-by: John Meneghini <johnm@netapp.com>
Submitted by: Michael Hordijk <hordijk@netapp.com>
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D19781
2019-05-08 20:18:11 +00:00
Warner Losh
2ffd6fce5b Don't print all the I/O we abort on a reset, unless we're out of
retries.

When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.

Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
2019-03-09 01:18:16 +00:00
Warner Losh
45d7e233a5 Unconditionally support unmapped BIOs. This was another shim for
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
2019-02-27 22:16:59 +00:00
Gleb Smirnoff
756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Warner Losh
91182bcfb6 Even though they are reserved, cdw2 and cdw3 can be set via nvme-cli
(and soon nvmecontrol). Go ahead and copy them into rsvd2 and rsvd3.

Sponsored by: Netflix
2018-12-07 21:58:08 +00:00
Chuck Tuffli
9544e6dcf1 Make NVMe compatible with the original API
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).

This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.

This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.

Bump __FreeBSD_version to 1200081 due to API change

Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
2018-08-22 04:29:24 +00:00
Alexander Motin
f439e3a4ff Refactor NVMe CAM integration.
- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
 - Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
 - Add initial support for namespace change async events.  So far only
in CAM mode, but it allows run-time namespace arrival and departure.
 - Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.

Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.

Reviewed by:	imp
MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2018-05-25 03:34:33 +00:00
Alexander Motin
c252f63740 Fix LOR between controller and queue locks.
Admin pass-through requests took controller lock before the queue lock,
but in case of request submission to a failed controller controller lock
was taken after the queue lock.  Fix that by reducing the lock scopes and
switching to mtx_pool locks to track pass-through request completion.

Sponsored by:	iXsystems, Inc.
2018-05-02 20:13:03 +00:00
Alexander Motin
e134ecdcfc Improve nvme(4) attach/detach sequences.
This change allows clean device detach on attach failures and driver unload,
while previous code tried to talk to already shut down controller, or even
accessed resources failed to allocate.

Sponsored by:	iXsystems, Inc.
2018-04-30 23:05:57 +00:00
Warner Losh
5d7fd8f726 Fix error messages in cut and pasted code.
Also, fix an unnecessary deref to get ctrlr.

Noticed by: rpokala@
Sponsored by: Netflix
2018-03-14 23:28:28 +00:00
Warner Losh
8b1e6ebe0e When tearing down a queue pair, also delete the queue entries.
The NVME standard has required in section 7.2.6, since at least 1.1,
that a clean shutdown is signalled by deleting the subission and the
completion queues before setting the shutdown bit in CC. The 1.0
standard, apparently, did not and many of the early Intel cards didn't
care. Some newer cards care, at least one whose beta firmware can
scramble the card on an unclean shutdown. Linux has done this for some
time. To make it possible to move forward with an evaluation of this
pre-release card with wonky firmware, delete the queues on the card
when we delete the qpair structures.

Sponsored by: Netflix
2018-03-14 23:01:18 +00:00
Wojciech Macek
0d787e9b35 NVMe: Add big-endian support
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           imp, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
2018-02-22 13:32:31 +00:00
Warner Losh
29077eb456 Use atomic load and stores to ensure that the compiler doesn't
optimize away these loops. Change boolean to int to match what atomic
API supplies. Remove wmb() since the atomic_store_rel() on status.done
ensure the prior writes to status. It also fixes the fact that there
wasn't a rmb() before reading done. This should also be more efficient
since wmb() is fairly heavy weight.

Sponsored by: Netflix
Reviewed by: kib@, jim harris
Differential Revision: https://reviews.freebsd.org/D14053
2018-01-29 00:00:52 +00:00
Warner Losh
989c7f0b7c Although we only have one quirk at the moment, guard against the day
we have more than one by checking the actual quirk bit before delaying
the reset.

Noticed by: rpokala@
2017-12-18 20:11:21 +00:00
Warner Losh
ce1ec9c178 When we're disabling the nvme device, some drives have a controller
bug that requires 'hands off' for a period of time (2.3s) before we
check the RDY bit. Sicne this is a very odd quirk for a very limited
selection of drives, do this as a quirk. This prevented a successful
reset of the card when the card wedged.

Also, make sure that we comply with the advice from section 3.1.5 of
the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY
is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has
undefined results". Short circuit when EN == RDY == desired state.

Finally, fail the reset if the disable fails. This will lead to a
failed device, which is what we want. (note: nda device needs
work for coping with a failed device).

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13389
2017-12-18 18:38:00 +00:00
Pedro F. Giffuni
718cf2ccb9 sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
Warner Losh
bb1c7be429 Create general polling function for the nvme controller. Use it when
we're doing the various pin-based interrupt modes. Adjust
nvme_ctrlr_intx_handler to use nvme_ctrlr_poll.

Sponsored by: Netflix
Suggested by: scottl@
2017-10-15 16:18:08 +00:00
Warner Losh
5fff95cc1d Fix queue depth for nda.
1/4 of the number of queues times queue entries is too limiting. It
works up to about 4k IOPS / 3.0GB/s for hardware that can do
4.4k/3.2GB/s with nvd. 3/4 works better, though it highlights issues
in the fairness of nda's choice of TRIM vs READ. That will be fixed
separately.
2017-09-20 21:42:25 +00:00
Warner Losh
c02565f9fa Set the max transactions for NVMe drives better.
Provided a better estimate for the number of transactions that can be
pending at one time. This will be number of queues * number of
trackers / 4, as suggested by Jim Harris. This gives a better estimate
of the number of transactions that CAM should queue before applying
back pressure. This should be revisted when we have real multi-queue
support in CAM and the upper layers of the I/O stack.

Sponsored by: Netflix
2017-08-28 23:54:20 +00:00
Warner Losh
696c950297 NVME Namespace ID is 32-bits, so widen interface to reflect that.
Sponsored by: Netflix
2017-08-25 21:38:38 +00:00
Warner Losh
824073fbd6 Avoid dereferencing unintialized elements in the error path.
Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.

Sponsored by: Netflix
2017-03-07 23:06:41 +00:00
Warner Losh
a8a18dd590 Make multi-namespace nvme drives more robust.
Fix assumptions about name spaces in NVME driver. First, it assumes
cdata.nn is the number of configured devices. However, it is the
number of supported name spaces. Second, it assumes that there will
never be more than 16 name spaces supported, but a certain drive I'm
testing reports 1024. It assumes that name spaces are a tightly packed
namespace, but the standard seems to indicate otherwise. Finally, it
assumes that an error would be generated when quearying an
unconfigured namespace. Instead, it succeeds but the identify data is
all zeros.

Fix these by limiting the number of name spaces we probe to 16. Remove
aborting when we find one in error. When the size of the name space is
zero, ignore it.

This is admittedly a bandaide. The long term fix will be to
participate in the enumeration and name space change protocols
definfed in the NVNe standard.

Sponsored by: Netflix
2017-03-07 21:47:54 +00:00
Warner Losh
a3a6c48d66 Ensure that the passthrough request will fit in MAXPHYS bytes after it
has been rounded to full pages. This avoids a panic in
vm_fault_quick_hold_pages due to this off-by-one error passing one
page too many into vmapbuf.
2017-02-02 23:04:06 +00:00
Scott Long
a965389b5a Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a
bunch of safery belts and error handling in related codepaths.

Reviewed by:	jimharris
Obtained from:	Netflix
Differential Revision:	D8453
2016-11-08 00:24:49 +00:00
Warner Losh
f24c011beb Commit the bits of nda that were missed. This should fix the build.
Approved by: re@
2016-06-10 06:04:53 +00:00
Jim Harris
361e1fb408 nvme: fix intx handler to not dereference ioq during initialization
This was a regression from r293328, which deferred allocation
of the controller's ioq array until after interrupts are enabled
during boot.

PR:		207432
Reported and tested by: Andy Carrel <wac@google.com>
MFC after:	3 days
Sponsored by:	Intel
2016-02-24 00:01:10 +00:00
Justin Hibbits
43cd61606b Replace several bus_alloc_resource() calls using default arguments with bus_alloc_resource_any()
Since these calls only use default arguments, bus_alloc_resource_any() is the
right call.

Differential Revision: https://reviews.freebsd.org/D5306
2016-02-19 03:37:56 +00:00
Jim Harris
7b036d7790 nvme: avoid duplicate SET_NUM_QUEUES commands
nvme(4) issues a SET_NUM_QUEUES command during device
initialization to ensure enough I/O queues exists for each
of the MSI-X vectors we have allocated.  The SET_NUM_QUEUES
command is then issued again during nvme_ctrlr_start(), to
ensure that is properly set after any controller reset.

At least one NVMe drive exists which fails this second
SET_NUM_QUEUES command during device initialization.  So
change nvme_ctrlr_start() to only issue its SET_NUM_QUEUES
command when it is coming out of a reset - avoiding the
duplicate SET_NUM_QUEUES during device initialization.

Reported by:	gallatin
MFC after:	3 days
Sponsored by:	Intel
2016-02-11 17:32:41 +00:00