Commit Graph

127 Commits

Author SHA1 Message Date
mav
5c07241f1a Revert r292074 (by smh): Limit stripesize reported from nvd(4) to 4K
I believe that this patch handled the problem from the wrong side.
Instead of making ZFS properly handle large stripe sizes, it made
unrelated driver to lie in reported parameters to workaround that.

Alternative solution for this problem from ZFS side was committed at
r296615.

Discussed with:	smh
2016-03-10 17:13:10 +00:00
jimharris
419105461b nvme: fix intx handler to not dereference ioq during initialization
This was a regression from r293328, which deferred allocation
of the controller's ioq array until after interrupts are enabled
during boot.

PR:		207432
Reported and tested by: Andy Carrel <wac@google.com>
MFC after:	3 days
Sponsored by:	Intel
2016-02-24 00:01:10 +00:00
jhibbits
fbc9874dd0 Replace several bus_alloc_resource() calls using default arguments with bus_alloc_resource_any()
Since these calls only use default arguments, bus_alloc_resource_any() is the
right call.

Differential Revision: https://reviews.freebsd.org/D5306
2016-02-19 03:37:56 +00:00
jimharris
4deaf20e8b nvme: avoid duplicate SET_NUM_QUEUES commands
nvme(4) issues a SET_NUM_QUEUES command during device
initialization to ensure enough I/O queues exists for each
of the MSI-X vectors we have allocated.  The SET_NUM_QUEUES
command is then issued again during nvme_ctrlr_start(), to
ensure that is properly set after any controller reset.

At least one NVMe drive exists which fails this second
SET_NUM_QUEUES command during device initialization.  So
change nvme_ctrlr_start() to only issue its SET_NUM_QUEUES
command when it is coming out of a reset - avoiding the
duplicate SET_NUM_QUEUES during device initialization.

Reported by:	gallatin
MFC after:	3 days
Sponsored by:	Intel
2016-02-11 17:32:41 +00:00
imp
b58cf84475 Implement power command to list all power modes, find out the power
mode we're in and to set the power mode.
2016-01-30 22:48:06 +00:00
jimharris
d613b647e0 nvme: replace NVME_CEILING macro with howmany()
Suggested by:	rpokala
MFC after:	3 days
2016-01-07 20:35:26 +00:00
jimharris
94f3dfd067 nvme: add hw.nvme.min_cpus_per_ioq tunable
Due to FreeBSD system-wide limits on number of MSI-X vectors
(https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321),
it may be desirable to allocate fewer than the maximum number
of vectors for an NVMe device, in order to save vectors for
other devices (usually Ethernet) that can take better
advantage of them and may be probed after NVMe.

This tunable is expressed in terms of minimum number of CPUs
per I/O queue instead of max number of queues per controller,
to allow for a more even distribution of CPUs per queue.  This
avoids cases where some number of CPUs have a dedicated queue,
but other CPUs need to share queues.  Ideally the PR referenced
above will eventually be fixed and the mechanism implemented
here becomes obsolete anyways.

While here, fix a bug in the CPUs per I/O queue calculation to
properly account for the admin queue's MSI-X vector.

Reviewed by:	gallatin
MFC after:	3 days
Sponsored by:	Intel
2016-01-07 20:32:04 +00:00
jimharris
fbca74bfe1 nvme: do not revert o single I/O queue when per-CPU queues not possible
Previously nvme(4) would revert to a signle I/O queue if it could not
allocate enought interrupt vectors or NVMe submission/completion queues
to have one I/O queue per core.  This patch determines how to utilize a
smaller number of available interrupt vectors, and assigns (as closely
as possible) an equal number of cores to each associated I/O queue.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:18:32 +00:00
jimharris
0c3ac2f3b1 nvme: break out interrupt setup code into a separate function
MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:12:42 +00:00
jimharris
35417c6f67 nvme: do not pre-allocate MSI-X IRQ resources
The issue referenced here was resolved by other changes
in recent commits, so this code is no longer needed.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:11:31 +00:00
jimharris
6f7e8e5393 nvme: remove per_cpu_io_queues from struct nvme_controller
Instead just use num_io_queues to make this determination.

This prepares for some future changes enabling use of multiple
queues when we do not have enough queues or MSI-X vectors
for one queue per CPU.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:09:56 +00:00
jimharris
5fd9219620 nvme: simplify some of the nested ifs in interrupt setup code
This prepares for some follow-up commits which do more work in
this area.

MFC after:	3 days
Sponsored by:	Intel
2016-01-07 16:08:04 +00:00
smh
0026debd97 Limit stripesize reported from nvd(4) to 4K
Intel NVMe controllers have a slow path for I/Os that span a 128KB stripe boundary but ZFS limits ashift, which is derived from d_stripesize, to 13 (8KB) so we limit the stripesize reported to geom(8) to 4KB.

This may result in a small number of additional I/Os to require splitting in nvme(4), however the NVMe I/O path is very efficient so these additional I/Os will cause very minimal (if any) difference in performance or CPU utilisation.

This can be controller by the new sysctl kern.nvme.max_optimal_sectorsize.

MFC after:	1 week
Sponsored by:	Multiplay
Differential Revision:	https://reviews.freebsd.org/D4446
2015-12-11 02:06:03 +00:00
jimharris
b93e62f3e6 nvd, nvme: report stripesize through GEOM disk layer
MFC after:	3 days
Sponsored by:	Intel
2015-10-30 16:35:18 +00:00
jimharris
f472a0a088 nvme: fix race condition in split bio completion path
Fixes race condition observed under following circumstances:

1) I/O split on 128KB boundary with Intel NVMe controller.
   Current Intel controllers produce better latency when
   I/Os do not span a 128KB boundary - even if the I/O size
   itself is less than 128KB.
2) Per-CPU I/O queues are enabled.
3) Child I/Os are submitted on different submission queues.
4) Interrupts for child I/O completions occur almost
   simultaneously.
5) ithread for child I/O A increments bio_inbed, then
   immediately is preempted (rendezvous IPI, higher priority
   interrupt).
6) ithread for child I/O B increments bio_inbed, then completes
   parent bio since all children are now completed.
7) parent bio is freed, and immediately reallocated for a VFS
   or gpart bio (including setting bio_children to 1 and
   clearing bio_driver1).
8) ithread for child I/O A resumes processing.  bio_children
   for what it thinks is the parent bio is set to 1, so it
   thinks it needs to complete the parent bio.

Result is either calling a NULL callback function, or double freeing
the bio to its uma zone.

PR:		203746
Reported by:	Drew Gallatin <gallatin@netflix.com>,
		Marc Goroff <mgoroff@quorum.net>
Tested by:	Drew Gallatin <gallatin@netflix.com>
MFC after:	3 days
Sponsored by:	Intel
2015-10-30 16:06:34 +00:00
jimharris
aebe5a7c16 nvme: do not notify a consumer about failures that occur during initialization
MFC after:	3 days
Sponsored by:	Intel
2015-07-29 21:29:50 +00:00
jeff
3fb666cfae Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
   flags to switch buffers between unmapped and mapped.  This eliminates
   multiple flags and generally simplifies the logic.
 - Eliminate b_saveaddr since it is only used with pager bufs which have
   their b_data re-initialized on each allocation.
 - Gather up some convenience routines in the buffer cache for
   manipulating buf space and buf malloc space.
 - Add an inline, buf_mapped(), to standardize checks around unmapped
   buffers.

In collaboration with: mlaier
Reviewed by:	kib
Tested by:	pho (many small revisions ago)
Sponsored by:	EMC / Isilon Storage Division
2015-07-23 19:13:41 +00:00
jimharris
05e7b607e8 nvme: ensure csts.rdy bit is cleared before returning from nvme_ctrlr_disable
PR:		200458
MFC after:	3 days
Sponsored by:	Intel
2015-07-23 15:50:39 +00:00
jimharris
1521348f88 nvme: properly handle case where pci_alloc_msix does not alloc all vectors
Reported by: Sean Kelly <smkelly@smkelly.org>
MFC after:	3 days
Sponsored by:	Intel
2015-07-23 15:35:08 +00:00
jimharris
8ae9141498 nvme: use BUS_SPACE_MAXSIZE for bus_dma_tag_create maxsize parameter
This fixes i386 PAE build fallout from r281281.

Reported by:	bz
MFC after:	1 week
2015-04-09 00:37:55 +00:00
jimharris
e60358657f nvme: remove CHATHAM related code
Chatham was an internal NVMe prototype board used for
early driver development.

MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:52:06 +00:00
jimharris
eef38a1316 nvme: add device strings for Intel DC series NVMe SSDs
MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:50:45 +00:00
jimharris
f54beeaff1 nvme: create separate DMA tag for non-payload DMA buffers
Submission and completion queue memory need to use a
separate DMA tag for mappings than payload buffers,
to ensure mappings remain contiguous even with DMAR
enabled.

Submitted by:	kib
MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:49:45 +00:00
jimharris
d87d9c87c8 nvme: fall back to a smaller MSI-X vector allocation if necessary
Previously, if per-CPU MSI-X vectors could not be allocated,
nvme(4) would fall back to INTx with a single I/O queue pair.
This change will still fall back to a single I/O queue pair, but
allocate MSI-X vectors instead of reverting to INTx.

MFC after:	1 week
Sponsored by:	Intel
2015-04-08 21:46:18 +00:00
jimharris
ba5ebd0f6e Use bitwise OR instead of logical OR when constructing value for
SET_FEATURES/NUMBER_OF_QUEUES command.

Sponsored by:	Intel
MFC after:	3 days
2014-06-10 21:40:43 +00:00
jimharris
02a90ad562 nvme: Allocate all MSI resources up front so that we can fall back to
INTx if necessary.

Sponsored by:	Intel
MFC after:	3 days
2014-03-18 18:10:35 +00:00
jimharris
d8e987e9b2 nvme: Close hole where nvd(4) would not be notified of all nvme(4)
instances if modules loaded during boot.

Sponsored by:	Intel
MFC after:	3 days
2014-03-18 18:09:08 +00:00
jimharris
51675bd43f nvme: NVMe specification dictates 4-byte alignment for PRPs (not 8).
Sponsored by:	Intel
MFC after:	3 days
2014-03-17 22:37:17 +00:00
jimharris
797abe9803 nvme: Remove the software progress marker SET_FEATURE command during
controller initialization.

The spec says OS drivers should send this command after controller
initialization completes successfully, but other NVMe OS drivers are
not sending this command.  This change will therefore reduce differences
between the FreeBSD and other OS drivers.

Sponsored by:	Intel
MFC after:	3 days
2014-03-17 22:36:04 +00:00
jimharris
e31eb3d992 For IDENTIFY passthrough commands to Chatham prototype controllers, copy
the spoofed identify data into the user buffer rather than issuing the
command to the controller, since Chatham IDENTIFY data is always spoofed.

While here, fix a bug in the spoofed data for Chatham submission and
completion queue entry sizes.

Sponsored by:	Intel
MFC after:	3 days
2014-01-06 23:51:26 +00:00
jimharris
85f1ea0fa6 Create a unique unit number for each controller and namespace cdev.
Sponsored by:	Intel
MFC after:	3 days
2013-11-01 23:30:54 +00:00
jimharris
2b26115030 Fix the LINT build.
Approved by:	re (implicit)
MFC after:	1 week
2013-10-08 23:23:04 +00:00
jimharris
bb769cc348 Do not leak resources during attach if nvme_ctrlr_construct() or the initial
controller resets fail.

Sponsored by:	Intel
Reviewed by:	carl
Approved by:	re (hrs)
MFC after:	1 week
2013-10-08 16:01:43 +00:00
jimharris
64e2a5a8e6 Log and then disable asynchronous notification of persistent events after
they occur.

This prevents repeated notifications of the same event.

Status of these events may be viewed at any time by viewing the
SMART/Health Info Page using nvmecontrol, whether or not asynchronous
events notifications for those events are enabled.  This log page can
be viewed using:

    nvmecontrol logpage -p 2 <ctrlr id>

Future enhancements may re-enable these notifications on a periodic basis
so that if the notified condition persists, it will continue to be logged.

Sponsored by:	Intel
Reviewed by:	carl
Approved by:	re (hrs)
MFC after:	1 week
2013-10-08 16:00:12 +00:00
jimharris
9cdb85e5c1 Do not enable temperature threshold as an asynchronous event notification
on NVMe controllers that do not support it.

Sponsored by:	Intel
Reviewed by:	carl
Approved by:	re (hrs)
MFC after:	1 week
2013-10-08 15:49:14 +00:00
jimharris
bb66cfd2ae Extend some 32-bit fields and variables to 64-bit to prevent overflow
when calculating stats in nvmecontrol perftest.

Sponsored by:	Intel
Reported by:	Joe Golio <joseph.golio@emc.com>
Reviewed by:	carl
Approved by:	re (hrs)
MFC after:	1 week
2013-10-08 15:47:22 +00:00
jimharris
509a795193 Add driver-assisted striping for upcoming Intel NVMe controllers that can
benefit from it.

Sponsored by:	Intel
Reviewed by:	kib (earlier version), carl
Approved by:	re (hrs)
MFC after:	1 week
2013-10-08 15:44:04 +00:00
ken
5591de079d Change the way that unmapped I/O capability is advertised.
The previous method was to set the D_UNMAPPED_IO flag in the cdevsw
for the driver.  The problem with this is that in many cases (e.g.
sa(4)) there may be some instances of the driver that can handle
unmapped I/O and some that can't.  The isp(4) driver can handle
unmapped I/O, but the esp(4) driver currently cannot.  The cdevsw
is shared among all driver instances.

So instead of setting a flag on the cdevsw, set a flag on the cdev.
This allows drivers to indicate support for unmapped I/O on a
per-instance basis.

sys/conf.h:	Remove the D_UNMAPPED_IO cdevsw flag and replace it
		with an SI_UNMAPPED cdev flag.

kern_physio.c:	Look at the cdev SI_UNMAPPED flag to determine
		whether or not a particular driver can handle
		unmapped I/O.

geom_dev.c:	Set the SI_UNMAPPED flag for all GEOM cdevs.
		Since GEOM will create a temporary mapping when
		needed, setting SI_UNMAPPED unconditionally will
		work.

		Remove the D_UNMAPPED_IO flag.

nvme_ns.c:	Set the SI_UNMAPPED flag on cdevs created here
		if NVME_UNMAPPED_BIO_SUPPORT is enabled.

vfs_aio.c:	In aio_qphysio(), check the SI_UNMAPPED flag on a
		cdev instead of the D_UNMAPPED_IO flag on the cdevsw.

sys/param.h:	Bump __FreeBSD_version to 1000045 for the switch from
		setting the D_UNMAPPED_IO flag in the cdevsw to setting
		SI_UNMAPPED in the cdev.

Reviewed by:	kib, jimharris
MFC after:	1 week
Sponsored by:	Spectra Logic
2013-08-15 22:52:39 +00:00
jimharris
53b17a5f06 If a controller fails to initialize, do not notify consumers (nvd) of its
namespaces.

Sponsoredy by:	Intel
Reviewed by:	carl
MFC after:	3 days
2013-08-13 21:49:32 +00:00
jimharris
3f846da35a Send a shutdown notification in the driver unload path, to ensure
notification gets sent in cases where system shuts down with driver
unloaded.

Sponsored by:	Intel
Reviewed by:	carl
MFC after:	3 days
2013-08-13 21:47:08 +00:00
jimharris
52bfa150c7 Add message when nvd disks are attached and detached.
As part of this commit, add an nvme_strvis() function which borrows
heavily from cam_strvis().  This will allow stripping of
leading/trailing whitespace and also handle unprintable characters
in model/serial numbers.  This function goes into a new nvme_util.c
file which is used by both the driver and nvmecontrol.

Sponsored by:	Intel
Reviewed by:	carl
MFC after:	3 days
2013-07-19 21:40:57 +00:00
jimharris
c3dfb166ee Fix nvme(4) and nvd(4) to support non 512-byte sector sizes.
Recent testing with QEMU that has variable sector size support for
NVMe uncovered some of these issues.  Chatham prototype boards supported
only 512 byte sectors.

Sponsored by:	Intel
Reviewed by:	carl
MFC after:	3 days
2013-07-19 21:33:24 +00:00
jimharris
9f183750a6 Use pause() instead of DELAY() when polling for completion of admin
commands during controller initialization.

DELAY() does not work here during config_intrhook context - we need to
explicitly relinquish the CPU for the admin command completion to
get processed.

Sponsored by:	Intel
Reported by:	Adam Brooks <adam.j.brooks@intel.com>
Reviewed by:	carl
MFC after:	3 days
2013-07-17 23:26:56 +00:00
jimharris
8281445679 Define constants for the lengths of the serial number, model number
and firmware revision in the controller's identify structure.

Also modify consumers of these fields to ensure they only use the
specified number of bytes for their respective fields.

Sponsored by:	Intel
Reviewed by:	carl
MFC after:	3 days
2013-07-17 23:23:38 +00:00
jimharris
ae0660c354 Fix a poorly worded comment in nvme(4).
MFC after:	3 days
2013-07-11 15:02:38 +00:00
jimharris
6a4189c5fd Add comment explaining why CACHE_LINE_SIZE is defined in nvme_private.h
if not already defined elsewhere.

Requested by:	attilio
MFC after:	3 days
2013-07-09 21:24:19 +00:00
jimharris
d7c0528dab Update copyright dates.
MFC after:	3 days
2013-07-09 21:22:17 +00:00
jimharris
1dabbdc24c Do not retry failed async event requests.
Sponsored by:	Intel
MFC after:	3 days
2013-07-09 21:03:39 +00:00
jimharris
44e3ab8eb0 Add pci_enable_busmaster() and pci_disable_busmaster() calls in
nvme_attach() and nvme_detach() respectively.

Sponsored by:	Intel
MFC after:	3 days
2013-07-09 21:02:45 +00:00
jimharris
c15f698fb4 Add firmware replacement and activation support to nvmecontrol(8) through
a new firmware command.

NVMe controllers may support up to 7 firmware slots for storing of
different firmware revisions.  This new firmware command supports
firmware replacement (i.e. firmware download) with or without immediate
activation, or activation of a previously stored firmware image.  It
also supports selection of the firmware slot during replacement
operations, using IDENTIFY information from the controller to
check that the specified slot is valid.

Newly activated firmware does not take effect until the new controller
reset, either via a reboot or separate 'nvmecontrol reset' command to the
same controller.

Submitted by:	Joe Golio <joseph.golio@emc.com>
Obtained from:	EMC / Isilon Storage Division
MFC after:	3 days
2013-06-27 00:08:25 +00:00