Commit Graph

270 Commits

Author SHA1 Message Date
Chuck Tuffli
b1f1471064 Fix nda(4) PCIe link status output
Differentiate between PCI Express Endpoint devices and Root Complex
Integrated Endpoints in the nda driver. The Link Status and Capability
registers are not valid for Integrated Endpoints and should not be
displayed. The bhyve emulated NVMe device will advertise as being an
Integrated Endpoint.

Reviewed by:	imp
Approved byL	imp (mentor)
Differential Revision: https://reviews.freebsd.org/D20282
2019-06-07 18:34:48 +00:00
Warner Losh
d0aaeffdb4 Since a fatal trap can happen at aribtrary times, don't panic when the
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.

Reviewed by: jhb and markj (both prior versions)
Differential Revision:  https://reviews.freebsd.org/D20478
2019-06-01 15:37:44 +00:00
Warner Losh
9835d216d8 rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairs
Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to
match init/uninit scenarios.

Signed-off-by: John Meneghini <johnm@netapp.com>
Submitted by: Michael Hordijk <hordijk@netapp.com>
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D19781
2019-05-08 20:18:11 +00:00
Alexander Motin
1aed499575 Decode Deallocate Logical Block Features.
MFC after:	1 week
2019-05-05 15:47:21 +00:00
Warner Losh
2ffd6fce5b Don't print all the I/O we abort on a reset, unless we're out of
retries.

When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.

Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
2019-03-09 01:18:16 +00:00
Warner Losh
95108cadbc Add ABORTED_BY_REQUEST to the list of things we look at DNR bit and tell why to comment (code already does this) 2019-03-03 03:36:33 +00:00
Warner Losh
45d7e233a5 Unconditionally support unmapped BIOs. This was another shim for
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
2019-02-27 22:16:59 +00:00
Warner Losh
d706306d49 Remove #ifdef code to support FreeBSD versions that haven't been
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.
2019-02-27 22:05:01 +00:00
Warner Losh
52467047aa Regularize the Netflix copyright
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
   piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.

Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
2019-02-04 21:28:25 +00:00
Gleb Smirnoff
756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Chuck Tuffli
5312bd8eb0 Add NVMe drive to NOIOB quirk list
Dell-branded Intel P4600 NVMe drives benefit from NVMe 1.3's NOIOB
feature. Unfortunately just like Intel DC P4500s, they don't advertise
themselves as benefiting from this...

This changes adds P4600s to the existing list of old drives which
benefit from striping.

PR:		233969
Submitted by:	David Fugate <dave.fugate@gmail.com>
Reviewed by:	imp, mav
Approved by:	imp (mentor)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18772
2019-01-08 15:30:56 +00:00
Alexander Motin
a646135771 Add descriptions to NVMe interrupts.
MFC after:	1 month
2018-12-26 23:41:52 +00:00
Alexander Motin
511662d083 Remove CAM SIM lock from NVMe SIM.
CAM does not require SIM lock since FreeBSD 10.4, and NVMe code never
required it at all, using per-queue locks instead.  This formally allows
parallel request submission in CAM mode as much as single per-device and
per-queue locks of CAM allow.

MFC after:	1 month
2018-12-24 23:28:11 +00:00
Chuck Tuffli
87b3975e36 nda(4) fix check for Dataset Management support
In the nda(4) driver, only set DISKFLAG_CANDELETE (a.k.a. can support
BIO_DELETE) if the drive supports Dataset Management. There are reports
that without this check, VMWare Workstation does not work reliably.

Fix is to check the ONCS field in the NVMe Controller Data structure for
support. This check previously existed but did not survive the
big-endian changes.

Reported by: yuripv@yuripv.net
Reviewed by: imp, mav, jimharris
Approved by: imp (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18493
2018-12-13 13:25:37 +00:00
Warner Losh
91182bcfb6 Even though they are reserved, cdw2 and cdw3 can be set via nvme-cli
(and soon nvmecontrol). Go ahead and copy them into rsvd2 and rsvd3.

Sponsored by: Netflix
2018-12-07 21:58:08 +00:00
Warner Losh
14343799dc Remove do-nothing nvme_modevent.
nvme_modevent no longer does anything interesting, remove it.

Sponsored by: Netflix
2018-11-16 16:51:44 +00:00
Warner Losh
85939654f3 Use atomic_load_acq_int() here too to poll done, ala r328521 2018-11-13 22:41:20 +00:00
Warner Losh
09efa3dfb2 Put a workaround in for command timeout malfunctioning
At least one NVMe drive has a bug that makeing the Command Time Out
PCIe feature unreliable. The workaround is to disable this
feature. The driver wouldn't deal correctly with a timeout anyway.
Only do this for drives that are known bad.

Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D17708
2018-10-26 14:27:37 +00:00
Chuck Tuffli
9544e6dcf1 Make NVMe compatible with the original API
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).

This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.

This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.

Bump __FreeBSD_version to 1200081 due to API change

Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
2018-08-22 04:29:24 +00:00
Justin Hibbits
2e0090af65 nvme(4): Add bus_dmamap_sync() at the end of the request path
Summary:
Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.

Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see controller resets often.  With this change, I
don't see these resets messages, though another tester still does, for yet to be
determined reasons, so this may not be a complete fix.  Additionally, I see a
~5-10% speed up in buildworld times, likely due to not needing to reset the
controller.

Reviewed By: jimharris
Differential Revision: https://reviews.freebsd.org/D16570
2018-08-03 20:04:06 +00:00
Alexander Motin
f439e3a4ff Refactor NVMe CAM integration.
- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
 - Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
 - Add initial support for namespace change async events.  So far only
in CAM mode, but it allows run-time namespace arrival and departure.
 - Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.

Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.

Reviewed by:	imp
MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2018-05-25 03:34:33 +00:00
Warner Losh
041f49aece Remove the 'All Rights Reserved' clause from some of the stuff I've
done for Netflix, since I'm in the neighborhood.
2018-05-09 20:32:23 +00:00
Alexander Motin
c252f63740 Fix LOR between controller and queue locks.
Admin pass-through requests took controller lock before the queue lock,
but in case of request submission to a failed controller controller lock
was taken after the queue lock.  Fix that by reducing the lock scopes and
switching to mtx_pool locks to track pass-through request completion.

Sponsored by:	iXsystems, Inc.
2018-05-02 20:13:03 +00:00
Alexander Motin
e134ecdcfc Improve nvme(4) attach/detach sequences.
This change allows clean device detach on attach failures and driver unload,
while previous code tried to talk to already shut down controller, or even
accessed resources failed to allocate.

Sponsored by:	iXsystems, Inc.
2018-04-30 23:05:57 +00:00
Alexander Motin
c6c70c0746 Fix use-after-free in nvme_qpair_destroy().
dma_tag_payload should not be destroyed before payload_dma_map, and seems
it should be used there instead of dma_tag to match creation.

Sponsored by:	iXsystems, Inc.
2018-04-30 21:28:10 +00:00
Alexander Motin
e4c7e3a1b9 Set si_drv1 for nvmeXnsY in a new race-free way.
r332897 switched to new KPI, but havent used its main benefit.

Sponsored by:	iXsystems, Inc.
2018-04-30 19:21:20 +00:00
Warner Losh
76583d573d Migrate to make_dev_s interface to populate /dev/nvmeX entries
Submitted by: Michael Hordijk
Differential Revision: https://reviews.freebsd.org/D15162
2018-04-23 22:30:17 +00:00
Warner Losh
e8bef32ce2 Reword comment to remove awkward constructs, including an "it's" that
shouldn't have been there at all (it wasn't a typo for its, rather a
left-over from an older revision of the comment).

Noticed by: many
2018-04-19 16:05:48 +00:00
Warner Losh
b3e85e7a79 Intel drives have an optimal alignment for I/O. While they honor I/Os
that cross this boundary, they perform better when this isn't the
case. Intel uses the 3rd byte in the vendor specific area for
this. The DC P3500 was previously listed without any explanation. Add
the DC P3520 and DC P4500 to the list.

There won't be any others drives needing this quirk. Intel has
standardized a field in the namespace data in 1.3 (noiob).  A future
patch will use that if it exists, with fallback to this method.

Submitted by: Keith Busch
Reviewed by: jimharris@
2018-04-19 15:39:20 +00:00
Warner Losh
afdbfe1e1b Starting LBA is a 64bit number, so use htole64 instead of htole32. The
latter casts the LBA to a 32-bit number before assigning it to the 64
bit structure entity. This works fine on the first 2TB of TRIMs, but
terrible beyond that due to trucation.

Also, add an assert to make sure we don't end too many DSM TRIM
entries in one request.

Sponsored by: Netflix
2018-03-20 03:37:14 +00:00
Warner Losh
d85d964829 Try polling the qpairs on timeout.
On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.

This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.

PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix
2018-03-16 05:23:48 +00:00
Warner Losh
5d7fd8f726 Fix error messages in cut and pasted code.
Also, fix an unnecessary deref to get ctrlr.

Noticed by: rpokala@
Sponsored by: Netflix
2018-03-14 23:28:28 +00:00
Warner Losh
8b1e6ebe0e When tearing down a queue pair, also delete the queue entries.
The NVME standard has required in section 7.2.6, since at least 1.1,
that a clean shutdown is signalled by deleting the subission and the
completion queues before setting the shutdown bit in CC. The 1.0
standard, apparently, did not and many of the early Intel cards didn't
care. Some newer cards care, at least one whose beta firmware can
scramble the card on an unclean shutdown. Linux has done this for some
time. To make it possible to move forward with an evaluation of this
pre-release card with wonky firmware, delete the queues on the card
when we delete the qpair structures.

Sponsored by: Netflix
2018-03-14 23:01:18 +00:00
Warner Losh
d61cf64d0e Don't make the namespace devices eternal.
We'll need to delete namespaces soon, so go ahead and stop making
these devices eternal. It doesn't help much, and will be getting in
the way soon.

Sponsored by: Netflix
2018-03-14 23:01:04 +00:00
Warner Losh
807e94b2c3 Implement trim collapsing in nda
When multiple trims are in the queue, collapse them as much as
possible. At present, this usually results in only a few trims being
collapsed together, but more work on that will make it possible to do
hundreds (up to some configurable max).

Sponsored by: Netflix
2018-03-14 16:44:50 +00:00
Alexander Motin
01c1be35e0 Print fuses and fna fields in identify data.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2018-03-12 16:31:25 +00:00
Alexander Motin
6b1a96b16b Add new opcodes and statuses from NVMe 1.3a.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2018-03-11 06:30:09 +00:00
Alexander Motin
3fa5467a06 Add new identify data structures fields from NVMe 1.3a.
Some of them are already supported by existing hardware, so reporting
them `nvmecontrol identify` can be useful.
2018-03-11 05:09:02 +00:00
Kyle Evans
afdc2600c2 nvme: Unbreak LE builds after r329824
The parameter 'p' is unused if _BYTE_ORDER == _LITTLE_ENDIAN. Add in a
(void)p to fix the build.
2018-02-22 16:16:49 +00:00
Wojciech Macek
0d787e9b35 NVMe: Add big-endian support
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           imp, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
2018-02-22 13:32:31 +00:00
Warner Losh
0028abe633 Backout r329818, r329816 and r329815.
These aren't the commits I thought I was testing prior to
commit. Revert until I can sort out what happened and fix it.
2018-02-22 11:18:33 +00:00
Warner Losh
4d87e27125 Combine BIO_DELETE requests for nda devices
Now that we're queueing BIO_DELETE requests in the CAM I/O scheduler,
it make sense to try to combine as many as possible into a single
request to send down to hardware. Hopefully, lots of larger requests
like this are better than lots of individual transactions.

Note for future: need to limit based on total size of the trim
request. Should also collapse adjacent ranges where possible to
increase the size of the max payload.

Sponsored by: Netflix
2018-02-22 05:44:00 +00:00
Warner Losh
29077eb456 Use atomic load and stores to ensure that the compiler doesn't
optimize away these loops. Change boolean to int to match what atomic
API supplies. Remove wmb() since the atomic_store_rel() on status.done
ensure the prior writes to status. It also fixes the fact that there
wasn't a rmb() before reading done. This should also be more efficient
since wmb() is fairly heavy weight.

Sponsored by: Netflix
Reviewed by: kib@, jim harris
Differential Revision: https://reviews.freebsd.org/D14053
2018-01-29 00:00:52 +00:00
Pedro F. Giffuni
ac2fffa4b7 Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
Warner Losh
7e5f6f2588 Move setting of CAM_SIM_QUEUED to before we actually submit it to the
hardware. Setting it after is racy, and we can lose the race on a
heavily loaded system.

Reviewed by: scottl@, gallatin@
Sponsored by: Netflix
2018-01-17 17:08:26 +00:00
Pedro F. Giffuni
26c1d774b5 dev: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these is likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
2018-01-13 22:30:30 +00:00
Warner Losh
4484c8f5d2 Return domain, bus, slot, and function for the transport settings in
PATH_INQ requests for nvme.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13546
2017-12-20 19:13:55 +00:00
Warner Losh
989c7f0b7c Although we only have one quirk at the moment, guard against the day
we have more than one by checking the actual quirk bit before delaying
the reset.

Noticed by: rpokala@
2017-12-18 20:11:21 +00:00
Warner Losh
ce1ec9c178 When we're disabling the nvme device, some drives have a controller
bug that requires 'hands off' for a period of time (2.3s) before we
check the RDY bit. Sicne this is a very odd quirk for a very limited
selection of drives, do this as a quirk. This prevented a successful
reset of the card when the card wedged.

Also, make sure that we comply with the advice from section 3.1.5 of
the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY
is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has
undefined results". Short circuit when EN == RDY == desired state.

Finally, fail the reset if the disable fails. This will lead to a
failed device, which is what we want. (note: nda device needs
work for coping with a failed device).

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13389
2017-12-18 18:38:00 +00:00
Pedro F. Giffuni
718cf2ccb9 sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
Warner Losh
eab9d0a85b Inline pcie_link_{status,caps} where needed. Remove them as they
aren't really needed and I don't want to document them.

Suggested by: jhb@
Sponsored by: Netflix
2017-11-15 02:24:47 +00:00
Warner Losh
4e3b274457 Provide link speed data in XPT_GET_TRAN_SETTINGS. Provide full version
information for that and XPT_PATH_INQ. Provide macros to encode/decode
major/minor versions.  Read the link speed and lane count to compute
the base_transfer_speed for XPT_PATH_INQ.

Sponsored by: Netflix
2017-11-14 05:05:16 +00:00
Warner Losh
fa271a5d09 Closer examination shows that nvme and CAM both normally zero-fill
allocations (for req and ccb, which ultimately contain the
nvme_cmd). As such, we can micro-optimize these routines. Add a
comment to this effect, and bzero the ccb used to make the requests
for the nda dump rotuine so it more closely matches a ccb allocated
with xpt_get_ccb().

Sponsored by: Netflix
2017-10-15 23:53:55 +00:00
Warner Losh
29431e54b9 Use nvme_ctrlr_poll instead of nvme_ctrlr_intx_handler since it is
more general and doesn't try to access registers that may be undefined
when the card is in MSIX mode.

This change, along with r324630, r324631, r324632, makes nda crash
dumps work again. Previously, they only worked on CPU 0 when the stack
garbage was just so.

Sponsored by: Netflix
Suggested by: scottl@ (who provided earlier version of the patch)
2017-10-15 16:19:09 +00:00
Warner Losh
bb1c7be429 Create general polling function for the nvme controller. Use it when
we're doing the various pin-based interrupt modes. Adjust
nvme_ctrlr_intx_handler to use nvme_ctrlr_poll.

Sponsored by: Netflix
Suggested by: scottl@
2017-10-15 16:18:08 +00:00
Warner Losh
fbed8df259 Explicitly set reserved fields and 'fuse' to 0. This prevents us from
acidentally sending bogus values in these fields, which some drives
may reject with an error or worse (undefined behavior).

This is especially needed for the ndadump routine which allocates the
cmd from stack garbage....

Sponsored by: Netflix
2017-10-15 16:17:59 +00:00
Warner Losh
cfb43eb12e Tweak performance of nda completions
Use xpt_done_direct in preference to xpt_done when completing a
successful I/O. Continue to use xpt_done when there's an error, or for
completion of the submission of a CCB. This eliminates a context
switch to the cam_doneq thread.

Sponsored by: Netflix
Suggested by: scottl@
2017-09-28 01:27:00 +00:00
Warner Losh
5fff95cc1d Fix queue depth for nda.
1/4 of the number of queues times queue entries is too limiting. It
works up to about 4k IOPS / 3.0GB/s for hardware that can do
4.4k/3.2GB/s with nvd. 3/4 works better, though it highlights issues
in the fairness of nda's choice of TRIM vs READ. That will be fixed
separately.
2017-09-20 21:42:25 +00:00
Konstantin Belousov
5a21cd1941 The nvme module should explicitly declare dependency on the cam.
If both nvme and cam are compiled as modules, nvme cannot be kldloaded
otherwise.

Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-08-31 14:21:32 +00:00
Warner Losh
c2005bba77 Fix a few overlooked spots where the coded uses 16-bit NSIDs. Chuck
Tuffli had submitted a more thorough patch that I was unaware of when
I did my work and this brings in the bits I missed from that patch.

PR: 220267
Submitted by: Chuck Tuffli
2017-08-29 15:46:34 +00:00
Warner Losh
519772814d Add CAM/NVMe support for CAM_DATA_SG
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.

Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
2017-08-29 15:29:57 +00:00
Warner Losh
850564b948 Add new compile-time option NVME_USE_NVD that sets the default value
of the runtime hw.nvme.use_vnd tunable. We still default to nvd unless
otherwise requested.

Sponsored by: Netflix
2017-08-28 23:54:25 +00:00
Warner Losh
c02565f9fa Set the max transactions for NVMe drives better.
Provided a better estimate for the number of transactions that can be
pending at one time. This will be number of queues * number of
trackers / 4, as suggested by Jim Harris. This gives a better estimate
of the number of transactions that CAM should queue before applying
back pressure. This should be revisted when we have real multi-queue
support in CAM and the upper layers of the I/O stack.

Sponsored by: Netflix
2017-08-28 23:54:20 +00:00
Warner Losh
030edcce02 Fill in reserved areas from NVMe spec in the IDENTIFY structure
(struct nvme_controller_data) as defined in the NVM Express
specification, revsion 1.3.

Sponsored by: Netflix
2017-08-25 21:38:43 +00:00
Warner Losh
696c950297 NVME Namespace ID is 32-bits, so widen interface to reflect that.
Sponsored by: Netflix
2017-08-25 21:38:38 +00:00
Warner Losh
223a9b93ac Add feature codes from NVMe 1.3 specification:
o Automomous Power State Transition
o Host Memory Buffer
o Timestamp
o Keep Alive Timer
o Host Controlled Thermal Management
o Non-Operational Power State Config

Also note that feature codes 0x78-0x7f are reserved for the NVMe
Management Interface.

Sponsored by: Netflix
2017-08-25 21:38:29 +00:00
Warner Losh
0012e436e3 Use _Static_assert
These files are compiled in userland too, so we can't use sys/systm.h
and rely on CTASSERT. Switch to using _Static_assert instead.

MFC After: 3 days
Sponsored by: Netflix
2017-08-25 04:33:06 +00:00
Warner Losh
0c26c1992f Sanity check sizes
Add compile time sanity checks to make sure that packed structures are
the proper size, typically as defined in the NVMe standard.
2017-08-25 04:05:53 +00:00
Warner Losh
abb61405a6 Enable bus mastering on the device before resetting the device. The
card has to do PCIe transactions to complete the reset process, but
can't do them, per the PCIe spec, unless bus mastering is enabled.

Submitted by: Kinjal Patel
PR: 22166
2017-08-25 03:15:18 +00:00
Nathan Whitehorn
c670f31f19 Move NVME controller shutdown from being called as part of module unloading
to being called through the newbus DEVICE_SHUTDOWN() path. This ensures that
the NVME controller gets shut down before the device and bus disappear
and prevents data corruption on shutdown on at least Samsung EVO 960 SSDs.

PR:		kern/211852
Reviewed by:	imp
MFC after:	2 weeks
2017-08-12 22:13:06 +00:00
Warner Losh
d0e75394cf Use the correct queue depth for nda devices.
Submitted by: Matt Williams
2017-08-08 16:06:16 +00:00
Warner Losh
8a5d94f94d Make nvd vs nda choice boot-time rather than build-time
Introduce hw.nvme.use_nvd tunable. This tunable allows both nvd and
nda to be installed in the kernel, while allowing only one of them to
create devices. This is an all-or-nothing setting, and you can't
change it after boot-time. However, it will allow easier A/B testing.

Differential Revision: https://reviews.freebsd.org/D11825
2017-08-04 03:40:01 +00:00
Warner Losh
df4245150a This adds CAM pass(4) support for NVMe IO's. Applications indicate
the IO type (Admin or NVM) using XPT op-codes XPT_NVME_ADMIN or
XPT_NVME_IO.

Submitted by:   Chuck Tuffli <chuck@tuffli.net>
Differential Revision:  https://reviews.freebsd.org/D10247
2017-07-14 14:52:20 +00:00
Warner Losh
594ffc03cd Add new definitions for namespaces.
Sponsored by: Netflix
Submitted by: Matt Williams (via D11330)
2017-06-27 20:24:39 +00:00
Warner Losh
824073fbd6 Avoid dereferencing unintialized elements in the error path.
Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.

Sponsored by: Netflix
2017-03-07 23:06:41 +00:00
Warner Losh
05ee702af6 cwd10 takes the low 32-bits and cwd11 takes the upper 32-bits of the
lba. Rather than do a cast to uint64_t, which clang warns might be
unaligned, do the stores 32-bits at a time.

Sponsored by: Netflix
2017-03-07 23:02:59 +00:00
Warner Losh
a8a18dd590 Make multi-namespace nvme drives more robust.
Fix assumptions about name spaces in NVME driver. First, it assumes
cdata.nn is the number of configured devices. However, it is the
number of supported name spaces. Second, it assumes that there will
never be more than 16 name spaces supported, but a certain drive I'm
testing reports 1024. It assumes that name spaces are a tightly packed
namespace, but the standard seems to indicate otherwise. Finally, it
assumes that an error would be generated when quearying an
unconfigured namespace. Instead, it succeeds but the identify data is
all zeros.

Fix these by limiting the number of name spaces we probe to 16. Remove
aborting when we find one in error. When the size of the name space is
zero, ignore it.

This is admittedly a bandaide. The long term fix will be to
participate in the enumeration and name space change protocols
definfed in the NVNe standard.

Sponsored by: Netflix
2017-03-07 21:47:54 +00:00
Warner Losh
adc8145e6f Remove obsolete comment after prior rev. 2017-02-19 17:38:17 +00:00
Alexander Motin
950c5aca4a Remove dead mentions of CAM target mode APIs from drivers.
This makes grepping kernel for target mode implementation much easier.
2017-02-19 17:27:58 +00:00
Warner Losh
a3a6c48d66 Ensure that the passthrough request will fit in MAXPHYS bytes after it
has been rounded to full pages. This avoids a panic in
vm_fault_quick_hold_pages due to this off-by-one error passing one
page too many into vmapbuf.
2017-02-02 23:04:06 +00:00
Ravi Pokala
d3c06026c2 In the same vein as r311350, fix whitespace in handling of XPT_PATH_INQ in
several more drivers.

Sponsored by:	Panasas
2017-01-05 03:08:57 +00:00
Alan Somers
4195c7de24 Always null-terminate ccb_pathinq.(sim_vid|hba_vid|dev_name)
The sim_vid, hba_vid, and dev_name fields of struct ccb_pathinq are
fixed-length strings. AFAICT the only place they're read is in
sbin/camcontrol/camcontrol.c, which assumes they'll be null-terminated.
However, the kernel doesn't null-terminate them. A bunch of copy-pasted code
uses strncpy to write them, and doesn't guarantee null-termination. For at
least 4 drivers (mpr, mps, ciss, and hyperv), the hba_vid field actually
overflows. You can see the result by doing "camcontrol negotiate da0 -v".

This change null-terminates those fields everywhere they're set in the
kernel. It also shortens a few strings to ensure they'll fit within the
16-character field.

PR:		215474
Reported by:	Coverity
CID:		1009997 1010000 1010001 1010002 1010003 1010004 1010005
CID:		1331519 1010006 1215097 1010007 1288967 1010008 1306000
CID:		1211924 1010009 1010010 1010011 1010012 1010013 1010014
CID:		1147190 1010017 1010016 1010018 1216435 1010020 1010021
CID:		1010022 1009666 1018185 1010023 1010025 1010026 1010027
CID:		1010028 1010029 1010030 1010031 1010033 1018186 1018187
CID:		1010035 1010036 1010042 1010041 1010040 1010039
Reviewed by:	imp, sephe, slm
MFC after:	4 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D9037
Differential Revision:	https://reviews.freebsd.org/D9038
2017-01-04 20:26:42 +00:00
Warner Losh
0cf14228c4 Implement HGST Log page 0xc1, as documented in the HGST SN100 and
SN150 product manuals. Subpage 0x32 is documented, but not implemented.

Sponsored by: Netflix, Inc
2016-11-19 17:13:08 +00:00
Warner Losh
ab1dd0917b Print Intel's expanded Temperature log page.
Sponsored by: Netflix, Inc
2016-11-19 17:13:03 +00:00
Warner Losh
d01f26f590 Add log pages that Intel SSDs provide. It turns out that many of these
are widely implemented beyond just Intel drives.

Sponsored by: Netflix, Inc
2016-11-19 17:12:58 +00:00
Warner Losh
aea528795a Add log pages defined through NVM Express 1.2.1.
Sponsored by: Netflix, Inc
2016-11-19 17:12:53 +00:00
Warner Losh
dc58cdf95e Expand the SMART / Health Information Log Page (Page 02) printout
based on NVM Express 1.2.1 Standard.

Sponsored by: Netflix, Inc
2016-11-19 17:12:49 +00:00
Scott Long
a965389b5a Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a
bunch of safery belts and error handling in related codepaths.

Reviewed by:	jimharris
Obtained from:	Netflix
Differential Revision:	D8453
2016-11-08 00:24:49 +00:00
Warner Losh
34dc8f1bb4 Kill a few stray debug printfs. 2016-07-28 22:40:31 +00:00
Warner Losh
3a31c31c22 Actually import nvme_sim so the CAM attachment for NVME (nda) actually
works.

MFC after: 1 week
2016-07-21 03:11:39 +00:00
Scott Long
49e20d2420 Supporting flushing the dump before returning, and simplify/combine the
logic.  Switch to a 5us delay since most NVME devices can easily do 200,000
iops.

Submitted by:	imp
MFC after:	3 days
Sponsored by:	Netflix, Inc.
2016-07-19 19:09:23 +00:00
Scott Long
a498975ef7 Implement crashdump support on NVME
MFC after:	3 days
Sponsored by:	Netflix, Inc.
2016-07-19 03:13:51 +00:00
Warner Losh
f24c011beb Commit the bits of nda that were missed. This should fix the build.
Approved by: re@
2016-06-10 06:04:53 +00:00
Alexander Motin
ee7f4d8187 Revert r292074 (by smh): Limit stripesize reported from nvd(4) to 4K
I believe that this patch handled the problem from the wrong side.
Instead of making ZFS properly handle large stripe sizes, it made
unrelated driver to lie in reported parameters to workaround that.

Alternative solution for this problem from ZFS side was committed at
r296615.

Discussed with:	smh
2016-03-10 17:13:10 +00:00
Jim Harris
361e1fb408 nvme: fix intx handler to not dereference ioq during initialization
This was a regression from r293328, which deferred allocation
of the controller's ioq array until after interrupts are enabled
during boot.

PR:		207432
Reported and tested by: Andy Carrel <wac@google.com>
MFC after:	3 days
Sponsored by:	Intel
2016-02-24 00:01:10 +00:00
Justin Hibbits
43cd61606b Replace several bus_alloc_resource() calls using default arguments with bus_alloc_resource_any()
Since these calls only use default arguments, bus_alloc_resource_any() is the
right call.

Differential Revision: https://reviews.freebsd.org/D5306
2016-02-19 03:37:56 +00:00
Jim Harris
7b036d7790 nvme: avoid duplicate SET_NUM_QUEUES commands
nvme(4) issues a SET_NUM_QUEUES command during device
initialization to ensure enough I/O queues exists for each
of the MSI-X vectors we have allocated.  The SET_NUM_QUEUES
command is then issued again during nvme_ctrlr_start(), to
ensure that is properly set after any controller reset.

At least one NVMe drive exists which fails this second
SET_NUM_QUEUES command during device initialization.  So
change nvme_ctrlr_start() to only issue its SET_NUM_QUEUES
command when it is coming out of a reset - avoiding the
duplicate SET_NUM_QUEUES during device initialization.

Reported by:	gallatin
MFC after:	3 days
Sponsored by:	Intel
2016-02-11 17:32:41 +00:00
Warner Losh
038659e7dd Implement power command to list all power modes, find out the power
mode we're in and to set the power mode.
2016-01-30 22:48:06 +00:00
Jim Harris
9c6b5d40eb nvme: replace NVME_CEILING macro with howmany()
Suggested by:	rpokala
MFC after:	3 days
2016-01-07 20:35:26 +00:00
Jim Harris
50dea2da12 nvme: add hw.nvme.min_cpus_per_ioq tunable
Due to FreeBSD system-wide limits on number of MSI-X vectors
(https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321),
it may be desirable to allocate fewer than the maximum number
of vectors for an NVMe device, in order to save vectors for
other devices (usually Ethernet) that can take better
advantage of them and may be probed after NVMe.

This tunable is expressed in terms of minimum number of CPUs
per I/O queue instead of max number of queues per controller,
to allow for a more even distribution of CPUs per queue.  This
avoids cases where some number of CPUs have a dedicated queue,
but other CPUs need to share queues.  Ideally the PR referenced
above will eventually be fixed and the mechanism implemented
here becomes obsolete anyways.

While here, fix a bug in the CPUs per I/O queue calculation to
properly account for the admin queue's MSI-X vector.

Reviewed by:	gallatin
MFC after:	3 days
Sponsored by:	Intel
2016-01-07 20:32:04 +00:00