o Automomous Power State Transition
o Host Memory Buffer
o Timestamp
o Keep Alive Timer
o Host Controlled Thermal Management
o Non-Operational Power State Config
Also note that feature codes 0x78-0x7f are reserved for the NVMe
Management Interface.
Sponsored by: Netflix
These files are compiled in userland too, so we can't use sys/systm.h
and rely on CTASSERT. Switch to using _Static_assert instead.
MFC After: 3 days
Sponsored by: Netflix
card has to do PCIe transactions to complete the reset process, but
can't do them, per the PCIe spec, unless bus mastering is enabled.
Submitted by: Kinjal Patel
PR: 22166
to being called through the newbus DEVICE_SHUTDOWN() path. This ensures that
the NVME controller gets shut down before the device and bus disappear
and prevents data corruption on shutdown on at least Samsung EVO 960 SSDs.
PR: kern/211852
Reviewed by: imp
MFC after: 2 weeks
Introduce hw.nvme.use_nvd tunable. This tunable allows both nvd and
nda to be installed in the kernel, while allowing only one of them to
create devices. This is an all-or-nothing setting, and you can't
change it after boot-time. However, it will allow easier A/B testing.
Differential Revision: https://reviews.freebsd.org/D11825
the IO type (Admin or NVM) using XPT op-codes XPT_NVME_ADMIN or
XPT_NVME_IO.
Submitted by: Chuck Tuffli <chuck@tuffli.net>
Differential Revision: https://reviews.freebsd.org/D10247
Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.
Sponsored by: Netflix
Fix assumptions about name spaces in NVME driver. First, it assumes
cdata.nn is the number of configured devices. However, it is the
number of supported name spaces. Second, it assumes that there will
never be more than 16 name spaces supported, but a certain drive I'm
testing reports 1024. It assumes that name spaces are a tightly packed
namespace, but the standard seems to indicate otherwise. Finally, it
assumes that an error would be generated when quearying an
unconfigured namespace. Instead, it succeeds but the identify data is
all zeros.
Fix these by limiting the number of name spaces we probe to 16. Remove
aborting when we find one in error. When the size of the name space is
zero, ignore it.
This is admittedly a bandaide. The long term fix will be to
participate in the enumeration and name space change protocols
definfed in the NVNe standard.
Sponsored by: Netflix
The sim_vid, hba_vid, and dev_name fields of struct ccb_pathinq are
fixed-length strings. AFAICT the only place they're read is in
sbin/camcontrol/camcontrol.c, which assumes they'll be null-terminated.
However, the kernel doesn't null-terminate them. A bunch of copy-pasted code
uses strncpy to write them, and doesn't guarantee null-termination. For at
least 4 drivers (mpr, mps, ciss, and hyperv), the hba_vid field actually
overflows. You can see the result by doing "camcontrol negotiate da0 -v".
This change null-terminates those fields everywhere they're set in the
kernel. It also shortens a few strings to ensure they'll fit within the
16-character field.
PR: 215474
Reported by: Coverity
CID: 1009997 1010000 1010001 1010002 1010003 1010004 1010005
CID: 1331519 1010006 1215097 1010007 1288967 1010008 1306000
CID: 1211924 1010009 1010010 1010011 1010012 1010013 1010014
CID: 1147190 1010017 1010016 1010018 1216435 1010020 1010021
CID: 1010022 1009666 1018185 1010023 1010025 1010026 1010027
CID: 1010028 1010029 1010030 1010031 1010033 1018186 1018187
CID: 1010035 1010036 1010042 1010041 1010040 1010039
Reviewed by: imp, sephe, slm
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D9037
Differential Revision: https://reviews.freebsd.org/D9038
I believe that this patch handled the problem from the wrong side.
Instead of making ZFS properly handle large stripe sizes, it made
unrelated driver to lie in reported parameters to workaround that.
Alternative solution for this problem from ZFS side was committed at
r296615.
Discussed with: smh
This was a regression from r293328, which deferred allocation
of the controller's ioq array until after interrupts are enabled
during boot.
PR: 207432
Reported and tested by: Andy Carrel <wac@google.com>
MFC after: 3 days
Sponsored by: Intel
nvme(4) issues a SET_NUM_QUEUES command during device
initialization to ensure enough I/O queues exists for each
of the MSI-X vectors we have allocated. The SET_NUM_QUEUES
command is then issued again during nvme_ctrlr_start(), to
ensure that is properly set after any controller reset.
At least one NVMe drive exists which fails this second
SET_NUM_QUEUES command during device initialization. So
change nvme_ctrlr_start() to only issue its SET_NUM_QUEUES
command when it is coming out of a reset - avoiding the
duplicate SET_NUM_QUEUES during device initialization.
Reported by: gallatin
MFC after: 3 days
Sponsored by: Intel
Due to FreeBSD system-wide limits on number of MSI-X vectors
(https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321),
it may be desirable to allocate fewer than the maximum number
of vectors for an NVMe device, in order to save vectors for
other devices (usually Ethernet) that can take better
advantage of them and may be probed after NVMe.
This tunable is expressed in terms of minimum number of CPUs
per I/O queue instead of max number of queues per controller,
to allow for a more even distribution of CPUs per queue. This
avoids cases where some number of CPUs have a dedicated queue,
but other CPUs need to share queues. Ideally the PR referenced
above will eventually be fixed and the mechanism implemented
here becomes obsolete anyways.
While here, fix a bug in the CPUs per I/O queue calculation to
properly account for the admin queue's MSI-X vector.
Reviewed by: gallatin
MFC after: 3 days
Sponsored by: Intel
Previously nvme(4) would revert to a signle I/O queue if it could not
allocate enought interrupt vectors or NVMe submission/completion queues
to have one I/O queue per core. This patch determines how to utilize a
smaller number of available interrupt vectors, and assigns (as closely
as possible) an equal number of cores to each associated I/O queue.
MFC after: 3 days
Sponsored by: Intel
Instead just use num_io_queues to make this determination.
This prepares for some future changes enabling use of multiple
queues when we do not have enough queues or MSI-X vectors
for one queue per CPU.
MFC after: 3 days
Sponsored by: Intel
Intel NVMe controllers have a slow path for I/Os that span a 128KB stripe boundary but ZFS limits ashift, which is derived from d_stripesize, to 13 (8KB) so we limit the stripesize reported to geom(8) to 4KB.
This may result in a small number of additional I/Os to require splitting in nvme(4), however the NVMe I/O path is very efficient so these additional I/Os will cause very minimal (if any) difference in performance or CPU utilisation.
This can be controller by the new sysctl kern.nvme.max_optimal_sectorsize.
MFC after: 1 week
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4446
Fixes race condition observed under following circumstances:
1) I/O split on 128KB boundary with Intel NVMe controller.
Current Intel controllers produce better latency when
I/Os do not span a 128KB boundary - even if the I/O size
itself is less than 128KB.
2) Per-CPU I/O queues are enabled.
3) Child I/Os are submitted on different submission queues.
4) Interrupts for child I/O completions occur almost
simultaneously.
5) ithread for child I/O A increments bio_inbed, then
immediately is preempted (rendezvous IPI, higher priority
interrupt).
6) ithread for child I/O B increments bio_inbed, then completes
parent bio since all children are now completed.
7) parent bio is freed, and immediately reallocated for a VFS
or gpart bio (including setting bio_children to 1 and
clearing bio_driver1).
8) ithread for child I/O A resumes processing. bio_children
for what it thinks is the parent bio is set to 1, so it
thinks it needs to complete the parent bio.
Result is either calling a NULL callback function, or double freeing
the bio to its uma zone.
PR: 203746
Reported by: Drew Gallatin <gallatin@netflix.com>,
Marc Goroff <mgoroff@quorum.net>
Tested by: Drew Gallatin <gallatin@netflix.com>
MFC after: 3 days
Sponsored by: Intel
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.
In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division