95 Commits

Author SHA1 Message Date
Alexander Motin
a6d222eb68 Add more random bits from NVMe 1.4.
MFC after:	2 weeks
2019-08-03 02:36:35 +00:00
Alexander Motin
6c99d1325e Decode few more NVMe log pages.
In particular: Changed Namespace List, Commands Supported and Effects,
Reservation Notification, Sanitize Status.

Add few new arguments to `nvmecontrol log` subcommand.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-08-02 20:16:21 +00:00
Alexander Motin
a7bf63be69 Add IOCTL to translate nvdX into nvmeY and NSID.
While very useful by itself, it also makes `nvmecontrol` not depend on
hardcoded device names parsing, that in its turn makes simple to take
nvdX (and potentially any other) device names as arguments.

Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them
interchangeable for management purposes.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-08-01 21:44:07 +00:00
Warner Losh
08a607e0f3 Widen the type for to.
The timeout field in the CAPS register is defined to be 8 bits, so its type was
uint8_t. We recently started adding 1 to it to cope with rogue devices that
listed 0 timeout time (which is impossible). However, in so doing, other devices
that list 0xff (for a 2 minute timeout) were broken when adding 1
overflowed. Widen the type to be uint32_t like its source register to avoid the
issue.

Reported by: bapt@
2019-07-25 20:26:21 +00:00
Warner Losh
62d2cf1847 Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers.
These macros make places where we extract these easier to read. The shift and
mask stuff is also a bit tedious and error prone. Start with the CAP_LO and
CAP_HI registers since their scope is somewhat constrained. This is style
chagne only, no functional changes.

Reviewed by: chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20979
2019-07-18 15:41:10 +00:00
Warner Losh
dc9df3a59d Assume that the timeout value from the capacity is 1-based
Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1
costs little and copes with those NVMe drives that report '0' in this field
cheaply. This is consistent with what the Linux driver does as well.
2019-07-16 22:55:30 +00:00
Warner Losh
9835d216d8 rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairs
Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to
match init/uninit scenarios.

Signed-off-by: John Meneghini <johnm@netapp.com>
Submitted by: Michael Hordijk <hordijk@netapp.com>
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D19781
2019-05-08 20:18:11 +00:00
Warner Losh
2ffd6fce5b Don't print all the I/O we abort on a reset, unless we're out of
retries.

When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.

Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
2019-03-09 01:18:16 +00:00
Warner Losh
45d7e233a5 Unconditionally support unmapped BIOs. This was another shim for
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
2019-02-27 22:16:59 +00:00
Gleb Smirnoff
756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Warner Losh
91182bcfb6 Even though they are reserved, cdw2 and cdw3 can be set via nvme-cli
(and soon nvmecontrol). Go ahead and copy them into rsvd2 and rsvd3.

Sponsored by: Netflix
2018-12-07 21:58:08 +00:00
Chuck Tuffli
9544e6dcf1 Make NVMe compatible with the original API
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).

This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.

This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.

Bump __FreeBSD_version to 1200081 due to API change

Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
2018-08-22 04:29:24 +00:00
Alexander Motin
f439e3a4ff Refactor NVMe CAM integration.
- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
 - Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
 - Add initial support for namespace change async events.  So far only
in CAM mode, but it allows run-time namespace arrival and departure.
 - Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.

Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.

Reviewed by:	imp
MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2018-05-25 03:34:33 +00:00
Alexander Motin
c252f63740 Fix LOR between controller and queue locks.
Admin pass-through requests took controller lock before the queue lock,
but in case of request submission to a failed controller controller lock
was taken after the queue lock.  Fix that by reducing the lock scopes and
switching to mtx_pool locks to track pass-through request completion.

Sponsored by:	iXsystems, Inc.
2018-05-02 20:13:03 +00:00
Alexander Motin
e134ecdcfc Improve nvme(4) attach/detach sequences.
This change allows clean device detach on attach failures and driver unload,
while previous code tried to talk to already shut down controller, or even
accessed resources failed to allocate.

Sponsored by:	iXsystems, Inc.
2018-04-30 23:05:57 +00:00
Warner Losh
5d7fd8f726 Fix error messages in cut and pasted code.
Also, fix an unnecessary deref to get ctrlr.

Noticed by: rpokala@
Sponsored by: Netflix
2018-03-14 23:28:28 +00:00
Warner Losh
8b1e6ebe0e When tearing down a queue pair, also delete the queue entries.
The NVME standard has required in section 7.2.6, since at least 1.1,
that a clean shutdown is signalled by deleting the subission and the
completion queues before setting the shutdown bit in CC. The 1.0
standard, apparently, did not and many of the early Intel cards didn't
care. Some newer cards care, at least one whose beta firmware can
scramble the card on an unclean shutdown. Linux has done this for some
time. To make it possible to move forward with an evaluation of this
pre-release card with wonky firmware, delete the queues on the card
when we delete the qpair structures.

Sponsored by: Netflix
2018-03-14 23:01:18 +00:00
Wojciech Macek
0d787e9b35 NVMe: Add big-endian support
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           imp, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
2018-02-22 13:32:31 +00:00
Warner Losh
29077eb456 Use atomic load and stores to ensure that the compiler doesn't
optimize away these loops. Change boolean to int to match what atomic
API supplies. Remove wmb() since the atomic_store_rel() on status.done
ensure the prior writes to status. It also fixes the fact that there
wasn't a rmb() before reading done. This should also be more efficient
since wmb() is fairly heavy weight.

Sponsored by: Netflix
Reviewed by: kib@, jim harris
Differential Revision: https://reviews.freebsd.org/D14053
2018-01-29 00:00:52 +00:00
Warner Losh
989c7f0b7c Although we only have one quirk at the moment, guard against the day
we have more than one by checking the actual quirk bit before delaying
the reset.

Noticed by: rpokala@
2017-12-18 20:11:21 +00:00
Warner Losh
ce1ec9c178 When we're disabling the nvme device, some drives have a controller
bug that requires 'hands off' for a period of time (2.3s) before we
check the RDY bit. Sicne this is a very odd quirk for a very limited
selection of drives, do this as a quirk. This prevented a successful
reset of the card when the card wedged.

Also, make sure that we comply with the advice from section 3.1.5 of
the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY
is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has
undefined results". Short circuit when EN == RDY == desired state.

Finally, fail the reset if the disable fails. This will lead to a
failed device, which is what we want. (note: nda device needs
work for coping with a failed device).

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13389
2017-12-18 18:38:00 +00:00
Pedro F. Giffuni
718cf2ccb9 sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
Warner Losh
bb1c7be429 Create general polling function for the nvme controller. Use it when
we're doing the various pin-based interrupt modes. Adjust
nvme_ctrlr_intx_handler to use nvme_ctrlr_poll.

Sponsored by: Netflix
Suggested by: scottl@
2017-10-15 16:18:08 +00:00
Warner Losh
5fff95cc1d Fix queue depth for nda.
1/4 of the number of queues times queue entries is too limiting. It
works up to about 4k IOPS / 3.0GB/s for hardware that can do
4.4k/3.2GB/s with nvd. 3/4 works better, though it highlights issues
in the fairness of nda's choice of TRIM vs READ. That will be fixed
separately.
2017-09-20 21:42:25 +00:00
Warner Losh
c02565f9fa Set the max transactions for NVMe drives better.
Provided a better estimate for the number of transactions that can be
pending at one time. This will be number of queues * number of
trackers / 4, as suggested by Jim Harris. This gives a better estimate
of the number of transactions that CAM should queue before applying
back pressure. This should be revisted when we have real multi-queue
support in CAM and the upper layers of the I/O stack.

Sponsored by: Netflix
2017-08-28 23:54:20 +00:00
Warner Losh
696c950297 NVME Namespace ID is 32-bits, so widen interface to reflect that.
Sponsored by: Netflix
2017-08-25 21:38:38 +00:00
Warner Losh
824073fbd6 Avoid dereferencing unintialized elements in the error path.
Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.

Sponsored by: Netflix
2017-03-07 23:06:41 +00:00
Warner Losh
a8a18dd590 Make multi-namespace nvme drives more robust.
Fix assumptions about name spaces in NVME driver. First, it assumes
cdata.nn is the number of configured devices. However, it is the
number of supported name spaces. Second, it assumes that there will
never be more than 16 name spaces supported, but a certain drive I'm
testing reports 1024. It assumes that name spaces are a tightly packed
namespace, but the standard seems to indicate otherwise. Finally, it
assumes that an error would be generated when quearying an
unconfigured namespace. Instead, it succeeds but the identify data is
all zeros.

Fix these by limiting the number of name spaces we probe to 16. Remove
aborting when we find one in error. When the size of the name space is
zero, ignore it.

This is admittedly a bandaide. The long term fix will be to
participate in the enumeration and name space change protocols
definfed in the NVNe standard.

Sponsored by: Netflix
2017-03-07 21:47:54 +00:00
Warner Losh
a3a6c48d66 Ensure that the passthrough request will fit in MAXPHYS bytes after it
has been rounded to full pages. This avoids a panic in
vm_fault_quick_hold_pages due to this off-by-one error passing one
page too many into vmapbuf.
2017-02-02 23:04:06 +00:00
Scott Long
a965389b5a Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a
bunch of safery belts and error handling in related codepaths.

Reviewed by:	jimharris
Obtained from:	Netflix
Differential Revision:	D8453
2016-11-08 00:24:49 +00:00
Warner Losh
f24c011beb Commit the bits of nda that were missed. This should fix the build.
Approved by: re@
2016-06-10 06:04:53 +00:00
Jim Harris
361e1fb408 nvme: fix intx handler to not dereference ioq during initialization
This was a regression from r293328, which deferred allocation
of the controller's ioq array until after interrupts are enabled
during boot.

PR:		207432
Reported and tested by: Andy Carrel <wac@google.com>
MFC after:	3 days
Sponsored by:	Intel
2016-02-24 00:01:10 +00:00
Justin Hibbits
43cd61606b Replace several bus_alloc_resource() calls using default arguments with bus_alloc_resource_any()
Since these calls only use default arguments, bus_alloc_resource_any() is the
right call.

Differential Revision: https://reviews.freebsd.org/D5306
2016-02-19 03:37:56 +00:00
Jim Harris
7b036d7790 nvme: avoid duplicate SET_NUM_QUEUES commands
nvme(4) issues a SET_NUM_QUEUES command during device
initialization to ensure enough I/O queues exists for each
of the MSI-X vectors we have allocated.  The SET_NUM_QUEUES
command is then issued again during nvme_ctrlr_start(), to
ensure that is properly set after any controller reset.

At least one NVMe drive exists which fails this second
SET_NUM_QUEUES command during device initialization.  So
change nvme_ctrlr_start() to only issue its SET_NUM_QUEUES
command when it is coming out of a reset - avoiding the
duplicate SET_NUM_QUEUES during device initialization.

Reported by:	gallatin
MFC after:	3 days
Sponsored by:	Intel
2016-02-11 17:32:41 +00:00
Jim Harris
9c6b5d40eb nvme: replace NVME_CEILING macro with howmany()
Suggested by:	rpokala
MFC after:	3 days
2016-01-07 20:35:26 +00:00
Jim Harris
50dea2da12 nvme: add hw.nvme.min_cpus_per_ioq tunable
Due to FreeBSD system-wide limits on number of MSI-X vectors
(https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321),
it may be desirable to allocate fewer than the maximum number
of vectors for an NVMe device, in order to save vectors for
other devices (usually Ethernet) that can take better
advantage of them and may be probed after NVMe.

This tunable is expressed in terms of minimum number of CPUs
per I/O queue instead of max number of queues per controller,
to allow for a more even distribution of CPUs per queue.  This
avoids cases where some number of CPUs have a dedicated queue,
but other CPUs need to share queues.  Ideally the PR referenced
above will eventually be fixed and the mechanism implemented
here becomes obsolete anyways.

While here, fix a bug in the CPUs per I/O queue calculation to
properly account for the admin queue's MSI-X vector.

Reviewed by:	gallatin
MFC after:	3 days
Sponsored by:	Intel
2016-01-07 20:32:04 +00:00
Jim Harris
2b647da7a0 nvme: do not revert o single I/O queue when per-CPU queues not possible
Previously nvme(4) would revert to a signle I/O queue if it could not
allocate enought interrupt vectors or NVMe submission/completion queues
to have one I/O queue per core.  This patch determines how to utilize a
smaller number of available interrupt vectors, and assigns (as closely
as possible) an equal number of cores to each associated I/O queue.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:18:32 +00:00
Jim Harris
d400f790b1 nvme: break out interrupt setup code into a separate function
MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:12:42 +00:00
Jim Harris
e5af5854ff nvme: do not pre-allocate MSI-X IRQ resources
The issue referenced here was resolved by other changes
in recent commits, so this code is no longer needed.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:11:31 +00:00
Jim Harris
c75ad8ce5a nvme: remove per_cpu_io_queues from struct nvme_controller
Instead just use num_io_queues to make this determination.

This prepares for some future changes enabling use of multiple
queues when we do not have enough queues or MSI-X vectors
for one queue per CPU.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:09:56 +00:00
Jim Harris
d85f84abb8 nvme: simplify some of the nested ifs in interrupt setup code
This prepares for some follow-up commits which do more work in
this area.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:08:04 +00:00
Jeff Roberson
fade8dd714 Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
   flags to switch buffers between unmapped and mapped.  This eliminates
   multiple flags and generally simplifies the logic.
 - Eliminate b_saveaddr since it is only used with pager bufs which have
   their b_data re-initialized on each allocation.
 - Gather up some convenience routines in the buffer cache for
   manipulating buf space and buf malloc space.
 - Add an inline, buf_mapped(), to standardize checks around unmapped
   buffers.

In collaboration with: mlaier
Reviewed by:	kib
Tested by:	pho (many small revisions ago)
Sponsored by:	EMC / Isilon Storage Division
2015-07-23 19:13:41 +00:00
Jim Harris
cbdec09c1c nvme: ensure csts.rdy bit is cleared before returning from nvme_ctrlr_disable
PR:		200458
MFC after:	3 days
Sponsored by:	Intel
2015-07-23 15:50:39 +00:00
Jim Harris
de9a58f4ee nvme: properly handle case where pci_alloc_msix does not alloc all vectors
Reported by: Sean Kelly <smkelly@smkelly.org>
MFC after:	3 days
Sponsored by:	Intel
2015-07-23 15:35:08 +00:00
Jim Harris
36b0e4ee1f nvme: remove CHATHAM related code
Chatham was an internal NVMe prototype board used for
early driver development.

MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:52:06 +00:00
Jim Harris
e5ce537999 nvme: fall back to a smaller MSI-X vector allocation if necessary
Previously, if per-CPU MSI-X vectors could not be allocated,
nvme(4) would fall back to INTx with a single I/O queue pair.
This change will still fall back to a single I/O queue pair, but
allocate MSI-X vectors instead of reverting to INTx.

MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:46:18 +00:00
Jim Harris
f42ca756b9 nvme: Allocate all MSI resources up front so that we can fall back to
INTx if necessary.

Sponsored by:	Intel
MFC after:	3 days
2014-03-18 18:10:35 +00:00
Jim Harris
496a27520d nvme: Close hole where nvd(4) would not be notified of all nvme(4)
instances if modules loaded during boot.

Sponsored by:	Intel
MFC after:	3 days
2014-03-18 18:09:08 +00:00
Jim Harris
2b26030cbc nvme: Remove the software progress marker SET_FEATURE command during
controller initialization.

The spec says OS drivers should send this command after controller
initialization completes successfully, but other NVMe OS drivers are
not sending this command.  This change will therefore reduce differences
between the FreeBSD and other OS drivers.

Sponsored by:	Intel
MFC after:	3 days
2014-03-17 22:36:04 +00:00
Jim Harris
448cffc859 For IDENTIFY passthrough commands to Chatham prototype controllers, copy
the spoofed identify data into the user buffer rather than issuing the
command to the controller, since Chatham IDENTIFY data is always spoofed.

While here, fix a bug in the spoofed data for Chatham submission and
completion queue entry sizes.

Sponsored by:	Intel
MFC after:	3 days
2014-01-06 23:51:26 +00:00