201 Commits

Author SHA1 Message Date
jhibbits
da001c8d4b nvme(4): Add bus_dmamap_sync() at the end of the request path
Summary:
Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.

Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see controller resets often.  With this change, I
don't see these resets messages, though another tester still does, for yet to be
determined reasons, so this may not be a complete fix.  Additionally, I see a
~5-10% speed up in buildworld times, likely due to not needing to reset the
controller.

Reviewed By: jimharris
Differential Revision: https://reviews.freebsd.org/D16570
2018-08-03 20:04:06 +00:00
mav
a8d82e59ae Refactor NVMe CAM integration.
- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
 - Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
 - Add initial support for namespace change async events.  So far only
in CAM mode, but it allows run-time namespace arrival and departure.
 - Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.

Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.

Reviewed by:	imp
MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2018-05-25 03:34:33 +00:00
imp
899bd2ec13 Remove the 'All Rights Reserved' clause from some of the stuff I've
done for Netflix, since I'm in the neighborhood.
2018-05-09 20:32:23 +00:00
mav
8bccbb2ef4 Fix LOR between controller and queue locks.
Admin pass-through requests took controller lock before the queue lock,
but in case of request submission to a failed controller controller lock
was taken after the queue lock.  Fix that by reducing the lock scopes and
switching to mtx_pool locks to track pass-through request completion.

Sponsored by:	iXsystems, Inc.
2018-05-02 20:13:03 +00:00
mav
d07ba3d42e Improve nvme(4) attach/detach sequences.
This change allows clean device detach on attach failures and driver unload,
while previous code tried to talk to already shut down controller, or even
accessed resources failed to allocate.

Sponsored by:	iXsystems, Inc.
2018-04-30 23:05:57 +00:00
mav
01577577b3 Fix use-after-free in nvme_qpair_destroy().
dma_tag_payload should not be destroyed before payload_dma_map, and seems
it should be used there instead of dma_tag to match creation.

Sponsored by:	iXsystems, Inc.
2018-04-30 21:28:10 +00:00
mav
6da2a12d85 Set si_drv1 for nvmeXnsY in a new race-free way.
r332897 switched to new KPI, but havent used its main benefit.

Sponsored by:	iXsystems, Inc.
2018-04-30 19:21:20 +00:00
imp
c0908e1f7b Migrate to make_dev_s interface to populate /dev/nvmeX entries
Submitted by: Michael Hordijk
Differential Revision: https://reviews.freebsd.org/D15162
2018-04-23 22:30:17 +00:00
imp
fd0c9e33d4 Reword comment to remove awkward constructs, including an "it's" that
shouldn't have been there at all (it wasn't a typo for its, rather a
left-over from an older revision of the comment).

Noticed by: many
2018-04-19 16:05:48 +00:00
imp
ef05fd0de0 Intel drives have an optimal alignment for I/O. While they honor I/Os
that cross this boundary, they perform better when this isn't the
case. Intel uses the 3rd byte in the vendor specific area for
this. The DC P3500 was previously listed without any explanation. Add
the DC P3520 and DC P4500 to the list.

There won't be any others drives needing this quirk. Intel has
standardized a field in the namespace data in 1.3 (noiob).  A future
patch will use that if it exists, with fallback to this method.

Submitted by: Keith Busch
Reviewed by: jimharris@
2018-04-19 15:39:20 +00:00
imp
8d536b9f2e Starting LBA is a 64bit number, so use htole64 instead of htole32. The
latter casts the LBA to a 32-bit number before assigning it to the 64
bit structure entity. This works fine on the first 2TB of TRIMs, but
terrible beyond that due to trucation.

Also, add an assert to make sure we don't end too many DSM TRIM
entries in one request.

Sponsored by: Netflix
2018-03-20 03:37:14 +00:00
imp
0ac2e39d57 Try polling the qpairs on timeout.
On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.

This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.

PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix
2018-03-16 05:23:48 +00:00
imp
354a82d881 Fix error messages in cut and pasted code.
Also, fix an unnecessary deref to get ctrlr.

Noticed by: rpokala@
Sponsored by: Netflix
2018-03-14 23:28:28 +00:00
imp
35688b25ae When tearing down a queue pair, also delete the queue entries.
The NVME standard has required in section 7.2.6, since at least 1.1,
that a clean shutdown is signalled by deleting the subission and the
completion queues before setting the shutdown bit in CC. The 1.0
standard, apparently, did not and many of the early Intel cards didn't
care. Some newer cards care, at least one whose beta firmware can
scramble the card on an unclean shutdown. Linux has done this for some
time. To make it possible to move forward with an evaluation of this
pre-release card with wonky firmware, delete the queues on the card
when we delete the qpair structures.

Sponsored by: Netflix
2018-03-14 23:01:18 +00:00
imp
a8dedf4fe6 Don't make the namespace devices eternal.
We'll need to delete namespaces soon, so go ahead and stop making
these devices eternal. It doesn't help much, and will be getting in
the way soon.

Sponsored by: Netflix
2018-03-14 23:01:04 +00:00
imp
1ddb7299f0 Implement trim collapsing in nda
When multiple trims are in the queue, collapse them as much as
possible. At present, this usually results in only a few trims being
collapsed together, but more work on that will make it possible to do
hundreds (up to some configurable max).

Sponsored by: Netflix
2018-03-14 16:44:50 +00:00
mav
a7ab51623b Print fuses and fna fields in identify data.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2018-03-12 16:31:25 +00:00
mav
73b7aa323b Add new opcodes and statuses from NVMe 1.3a.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2018-03-11 06:30:09 +00:00
mav
559bce3bae Add new identify data structures fields from NVMe 1.3a.
Some of them are already supported by existing hardware, so reporting
them `nvmecontrol identify` can be useful.
2018-03-11 05:09:02 +00:00
kevans
ba4f462a09 nvme: Unbreak LE builds after r329824
The parameter 'p' is unused if _BYTE_ORDER == _LITTLE_ENDIAN. Add in a
(void)p to fix the build.
2018-02-22 16:16:49 +00:00
wma
2858f9ff6e NVMe: Add big-endian support
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           imp, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
2018-02-22 13:32:31 +00:00
imp
56151d954c Backout r329818, r329816 and r329815.
These aren't the commits I thought I was testing prior to
commit. Revert until I can sort out what happened and fix it.
2018-02-22 11:18:33 +00:00
imp
25e0879e2e Combine BIO_DELETE requests for nda devices
Now that we're queueing BIO_DELETE requests in the CAM I/O scheduler,
it make sense to try to combine as many as possible into a single
request to send down to hardware. Hopefully, lots of larger requests
like this are better than lots of individual transactions.

Note for future: need to limit based on total size of the trim
request. Should also collapse adjacent ranges where possible to
increase the size of the max payload.

Sponsored by: Netflix
2018-02-22 05:44:00 +00:00
imp
07d4627b59 Use atomic load and stores to ensure that the compiler doesn't
optimize away these loops. Change boolean to int to match what atomic
API supplies. Remove wmb() since the atomic_store_rel() on status.done
ensure the prior writes to status. It also fixes the fact that there
wasn't a rmb() before reading done. This should also be more efficient
since wmb() is fairly heavy weight.

Sponsored by: Netflix
Reviewed by: kib@, jim harris
Differential Revision: https://reviews.freebsd.org/D14053
2018-01-29 00:00:52 +00:00
pfg
ced875130d Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
imp
5af8ebb2f8 Move setting of CAM_SIM_QUEUED to before we actually submit it to the
hardware. Setting it after is racy, and we can lose the race on a
heavily loaded system.

Reviewed by: scottl@, gallatin@
Sponsored by: Netflix
2018-01-17 17:08:26 +00:00
pfg
86c1e7ab7b dev: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these is likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
2018-01-13 22:30:30 +00:00
imp
7cb94763be Return domain, bus, slot, and function for the transport settings in
PATH_INQ requests for nvme.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13546
2017-12-20 19:13:55 +00:00
imp
89f962e844 Although we only have one quirk at the moment, guard against the day
we have more than one by checking the actual quirk bit before delaying
the reset.

Noticed by: rpokala@
2017-12-18 20:11:21 +00:00
imp
f03c0527dd When we're disabling the nvme device, some drives have a controller
bug that requires 'hands off' for a period of time (2.3s) before we
check the RDY bit. Sicne this is a very odd quirk for a very limited
selection of drives, do this as a quirk. This prevented a successful
reset of the card when the card wedged.

Also, make sure that we comply with the advice from section 3.1.5 of
the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY
is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has
undefined results". Short circuit when EN == RDY == desired state.

Finally, fail the reset if the disable fails. This will lead to a
failed device, which is what we want. (note: nda device needs
work for coping with a failed device).

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13389
2017-12-18 18:38:00 +00:00
pfg
1537078d8f sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
imp
7fd8c5bc07 Inline pcie_link_{status,caps} where needed. Remove them as they
aren't really needed and I don't want to document them.

Suggested by: jhb@
Sponsored by: Netflix
2017-11-15 02:24:47 +00:00
imp
c00b8f3c13 Provide link speed data in XPT_GET_TRAN_SETTINGS. Provide full version
information for that and XPT_PATH_INQ. Provide macros to encode/decode
major/minor versions.  Read the link speed and lane count to compute
the base_transfer_speed for XPT_PATH_INQ.

Sponsored by: Netflix
2017-11-14 05:05:16 +00:00
imp
d5eb569d3d Closer examination shows that nvme and CAM both normally zero-fill
allocations (for req and ccb, which ultimately contain the
nvme_cmd). As such, we can micro-optimize these routines. Add a
comment to this effect, and bzero the ccb used to make the requests
for the nda dump rotuine so it more closely matches a ccb allocated
with xpt_get_ccb().

Sponsored by: Netflix
2017-10-15 23:53:55 +00:00
imp
20da7e767b Use nvme_ctrlr_poll instead of nvme_ctrlr_intx_handler since it is
more general and doesn't try to access registers that may be undefined
when the card is in MSIX mode.

This change, along with r324630, r324631, r324632, makes nda crash
dumps work again. Previously, they only worked on CPU 0 when the stack
garbage was just so.

Sponsored by: Netflix
Suggested by: scottl@ (who provided earlier version of the patch)
2017-10-15 16:19:09 +00:00
imp
6feeaa6177 Create general polling function for the nvme controller. Use it when
we're doing the various pin-based interrupt modes. Adjust
nvme_ctrlr_intx_handler to use nvme_ctrlr_poll.

Sponsored by: Netflix
Suggested by: scottl@
2017-10-15 16:18:08 +00:00
imp
1edcf8dd07 Explicitly set reserved fields and 'fuse' to 0. This prevents us from
acidentally sending bogus values in these fields, which some drives
may reject with an error or worse (undefined behavior).

This is especially needed for the ndadump routine which allocates the
cmd from stack garbage....

Sponsored by: Netflix
2017-10-15 16:17:59 +00:00
imp
ff7911a913 Tweak performance of nda completions
Use xpt_done_direct in preference to xpt_done when completing a
successful I/O. Continue to use xpt_done when there's an error, or for
completion of the submission of a CCB. This eliminates a context
switch to the cam_doneq thread.

Sponsored by: Netflix
Suggested by: scottl@
2017-09-28 01:27:00 +00:00
imp
ec76f67a17 Fix queue depth for nda.
1/4 of the number of queues times queue entries is too limiting. It
works up to about 4k IOPS / 3.0GB/s for hardware that can do
4.4k/3.2GB/s with nvd. 3/4 works better, though it highlights issues
in the fairness of nda's choice of TRIM vs READ. That will be fixed
separately.
2017-09-20 21:42:25 +00:00
kib
d9a260e5d3 The nvme module should explicitly declare dependency on the cam.
If both nvme and cam are compiled as modules, nvme cannot be kldloaded
otherwise.

Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-31 14:21:32 +00:00
imp
20c4a6f953 Fix a few overlooked spots where the coded uses 16-bit NSIDs. Chuck
Tuffli had submitted a more thorough patch that I was unaware of when
I did my work and this brings in the bits I missed from that patch.

PR: 220267
Submitted by: Chuck Tuffli
2017-08-29 15:46:34 +00:00
imp
6267655c2e Add CAM/NVMe support for CAM_DATA_SG
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.

Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
2017-08-29 15:29:57 +00:00
imp
f1cb0bb9b1 Add new compile-time option NVME_USE_NVD that sets the default value
of the runtime hw.nvme.use_vnd tunable. We still default to nvd unless
otherwise requested.

Sponsored by: Netflix
2017-08-28 23:54:25 +00:00
imp
5d815f473d Set the max transactions for NVMe drives better.
Provided a better estimate for the number of transactions that can be
pending at one time. This will be number of queues * number of
trackers / 4, as suggested by Jim Harris. This gives a better estimate
of the number of transactions that CAM should queue before applying
back pressure. This should be revisted when we have real multi-queue
support in CAM and the upper layers of the I/O stack.

Sponsored by: Netflix
2017-08-28 23:54:20 +00:00
imp
64912a07d3 Fill in reserved areas from NVMe spec in the IDENTIFY structure
(struct nvme_controller_data) as defined in the NVM Express
specification, revsion 1.3.

Sponsored by: Netflix
2017-08-25 21:38:43 +00:00
imp
9ce3042b0f NVME Namespace ID is 32-bits, so widen interface to reflect that.
Sponsored by: Netflix
2017-08-25 21:38:38 +00:00
imp
2fd4069941 Add feature codes from NVMe 1.3 specification:
o Automomous Power State Transition
o Host Memory Buffer
o Timestamp
o Keep Alive Timer
o Host Controlled Thermal Management
o Non-Operational Power State Config

Also note that feature codes 0x78-0x7f are reserved for the NVMe
Management Interface.

Sponsored by: Netflix
2017-08-25 21:38:29 +00:00
imp
1c9cc9fd2e Use _Static_assert
These files are compiled in userland too, so we can't use sys/systm.h
and rely on CTASSERT. Switch to using _Static_assert instead.

MFC After: 3 days
Sponsored by: Netflix
2017-08-25 04:33:06 +00:00
imp
81eb7962f0 Sanity check sizes
Add compile time sanity checks to make sure that packed structures are
the proper size, typically as defined in the NVMe standard.
2017-08-25 04:05:53 +00:00
imp
977d8e54a1 Enable bus mastering on the device before resetting the device. The
card has to do PCIe transactions to complete the reset process, but
can't do them, per the PCIe spec, unless bus mastering is enabled.

Submitted by: Kinjal Patel
PR: 22166
2017-08-25 03:15:18 +00:00