Commit Graph

297 Commits

Author SHA1 Message Date
imp
02feab2c54 nvme: Remove a wmb() that's not necessary.
bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can
be DMA'd immediately after it returns. The details differ, but this
mirrors atomic thread release semantics, at least for the buffers
synced.

For non-x86 platforms, bus_dmamap_sync() has the right syncing and
fences. So in the past, wmb() had been omitted for them.

For x86 platforms, the memory ordering is already strong enough to
ensure DMA to the device sees the current contents. As such, we don't
need the wmb() here. It translates to an sfence which is only needed
for writes to regions that have the write combining attribute set or
when some exotic opcodes are used. The nvme driver does neither of
these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we
can be assured any optimizer won't reorder the bus_dmamap_sync and the
bus_space_write operations. The wmb() was a vestiage of the pre-busdma
version initially committed to the tree.

Reviewed by: kib@, gallatin@, chuck@, mav@
Differential Revision: https://reviews.freebsd.org/D27448
2020-12-04 21:34:48 +00:00
mmel
970d081572 NVME: Multiple busdma related fixes.
- in nvme_qpair_process_completions() do dma sync before completion buffer
  is used.
- in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm
  and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure
  that all CPU stores are visible to external (including DMA) observers.
- Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems,
  buffers continuously owned (and accessed) by DMA must be allocated with this
  flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems
  (or coherent buses in mixed systems).

MFC after:	4 weeks
Reviewed by:	mav, imp
Differential Revision: https://reviews.freebsd.org/D27446
2020-12-02 16:54:24 +00:00
chuck
aae2f76c43 nvme: Fix typo in definition
Change occurrences of "selt test" to "self tests in the NVMe header
file.

Reviewed by:	imp, mav
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D27439
2020-12-02 15:59:08 +00:00
mmel
135aeef1ec Always use the __unused attribute even for potentially unused parameters.
Requested by:	ian, imp
MFC with:	r368167
2020-12-01 08:52:13 +00:00
mmel
7ce0a5ef8a Unbreak r368167 in userland. Decorate unused arguments.
Reported by:	kp, tuexen, jenkins, and many others
MFC with:	r368167
2020-11-30 14:51:48 +00:00
mmel
985c152bf8 NVME: Don't try to swap data on little endian machines.
These swapping functions violate BUSDMA contract - we cannot write
to armed (by bus_dmamap_sync(PRE_..)) buffers. Remove them at least
from little endian machines until a better solution will be developed.

Reviewed by:	imp
MFC after:	3 weeks
2020-11-30 07:01:12 +00:00
mav
b1a18877d1 Remove aligment requirements for passthrough buffer.
After r368124 vmapbuf() should happily map misaligned maxphys-sized buffers
thanks to extra page added to pbuf_zone.
2020-11-29 00:57:19 +00:00
mav
cc27bf440d Increase nvme(4) maximum transfer size from 1MB to 2MB.
With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.

On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs.  The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.

Reviewed by:	imp
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-11-29 00:20:31 +00:00
kib
ca7c13e470 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
mmel
f34454d248 Ensure that the buffer is in nvme_single_map() mapped to single segment.
Not a functional change.

MFC after:	1 week
2020-11-23 14:30:22 +00:00
mav
ba953d1e7d Add PMRCAP printing and fix earlier CAP_HI.
MFC after:	3 days
2020-11-14 01:45:34 +00:00
mav
56b2971c7c Fix panic if NVMe is detached before the intrhook call.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-11-12 20:20:43 +00:00
mjg
1280af460f nvme: change namei_request_zone into a malloc type
Both the size (128 bytes) and ephemeral nature of allocations make it a great
fit for malloc.

A dedicated zone unnecessarily avoids sharing buckets with 128-byte objects.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D27103
2020-11-05 21:44:58 +00:00
mav
1cd599537c Fix unintentional constant rename in r367109.
MFC after:	1 week
2020-10-28 18:22:25 +00:00
mav
e157fb4acd Print NVMe controller capabilities in verbose dmesg.
Those values are not reported in controller identification, while sometimes
interesting for development and debugging.

MFC after:	1 week
2020-10-28 15:43:29 +00:00
imp
3f7e39ba85 nvme: Remove compat code for older kernels
Remove code that supported pre-2011 kernels. CTLTYPE_S64 was defined
in rev 217616. All supported branches have it, so remove its compat
definition as OBE.
2020-10-24 01:59:01 +00:00
brooks
7d8530e564 vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf.  Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf.  (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by:	jhb
Reviewed by:	imp, jhb
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26784
2020-10-21 16:00:15 +00:00
mav
56d097d368 Use RTD3 Entry Latency value as shutdown timeout.
This field was not in specs when the driver was written, but now there
are SSDs with the reported latency of 10s, where hardcoded value of 5s
seems to be not enough sometimes, causing shutdown timeout messages.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-10-14 15:50:28 +00:00
dab
b7b729ccb8 Add an ioctl to get an NVMe device's maximum transfer size
Reviewed by:	imp, chuck
Obtained from:	Dell EMC Isilon
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26390
2020-09-21 15:41:47 +00:00
mjg
1748821a48 nvme: clean up empty lines in .c and .h files 2020-09-01 22:03:10 +00:00
imp
f7f8993035 Use symbolic names for asych events
Rather than |= 0x300, define and use asyn event names for the name
space changes and the firmware activations that we're asking for.
2020-08-31 19:38:03 +00:00
mav
0f75f27389 Report cpi->hba_* for nda(4) because why not.
MFC after:	1 week
2020-08-12 20:05:43 +00:00
markj
2215e2cd8f Remove free_domain() and uma_zfree_domain().
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches.  They no longer serve any
purpose since UMA now provides their functionality by default.  Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by:	cem, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25937
2020-08-04 13:58:36 +00:00
mav
e7f2c302f6 Fix few panics on NVMe's timing out initialization requests.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-25 20:29:29 +00:00
mav
ab4eadd6ac Make polled request timeout less invasive.
Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests.  If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.

In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-18 19:16:03 +00:00
mav
c7e6bf9e22 Fix admin qpair leak if detached during initial reset.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-17 17:51:40 +00:00
mav
3a1e055ba4 Fix config_intrhook leak on initial reset failure.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-06-12 14:14:01 +00:00
dab
149ace228d Fix various Coverity-detected errors in nvme driver
This fixes several Coverity-detected errors in the nvme driver.

CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975,
1403980

Reviewed by:	imp@, vangyzen@
MFC after:	5 days
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D24532
2020-05-02 20:47:58 +00:00
imp
b150ec9fa9 Add KASSERT to ensure sane nsid.
All callers are currently filtering bad nsid to this function,
however, we'll have undefined behavior if that's not true. Add the
KASSERT to prevent that.
2020-05-01 21:24:19 +00:00
imp
af609559e3 Rename ns notification function...
This function is called whenever the namespace is added, deleted or
changes. Update the name to reflect that. No functional change.
2020-05-01 21:24:15 +00:00
imp
ad29f78218 Style(9) nit: put function name at start of line. 2020-04-30 20:58:38 +00:00
imp
2da1d539e3 Move / reword a comment.
Explain what we're doing with mapping CAM's notion of a LUN to NVMe's
notion of a namespace.
2020-04-30 20:58:33 +00:00
imp
b8ce7e8464 Make sure that we get the sbuf resources we need.
Since we're calling sbuf_new with NOWAIT, make sure it can allocate a
buffer to use. Don't print anything if we can't get it.

Noticed by: rpokala
2020-04-30 00:43:11 +00:00
imp
18a1b03238 Return the nvmeX device associated with the ndaX device.
Add the nvmeX device to the XPT_PATH_INQ nvme specific
information. while one could figure this out by looking up the
domain🚌slot:function, it's a lot easier to have the SIM set it
directly since the sim knows this.
2020-04-30 00:43:02 +00:00
imp
07ca42c0db Generate a devctl event for interesting events
When we reset the controller, and when the controller tells us about a
critical warning, send an event.
2020-04-30 00:27:19 +00:00
emaste
4742a826c3 remove extraneous double ;s in sys/ 2020-03-30 16:04:25 +00:00
kaktus
ad355b0a9d Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
scottl
b2c4b49613 Ever since the block layer expanded its command syntax beyond just
BIO_READ and BIO_WRITE, we've handled this expanded syntax poorly in
drivers when the driver doesn't support a particular command.  Do a
sweep and fix that.

Reported by:	imp
2020-02-07 09:22:08 +00:00
mav
ef8d51daa1 Fix copy-paste bug in HMB free code.
MFC after:	2 weeks
X-MFC-with:	r356474
2020-01-08 18:26:23 +00:00
mav
8f7704790f Minor adjustments to r356474 and r356480.
Reported by:	jkim, imp
MFC after:	2 weeks
X-MFC-with:	r356474
2020-01-07 23:29:54 +00:00
mav
0c75e47646 Increate HMB limit from 1% to 5%.
SSD capacity in laptops is growing faster then RAM size, so my original
guess seems too low on second thought.  Hopefully nobody will build large
array of those crappy SSDs.

MFC after:	2 weeks
X-MFC-with:	356474
2020-01-07 23:10:38 +00:00
mav
a91b9f9262 Add Host Memory Buffer support to nvme(4).
This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about
1MB per 1GB on the devices I have) for its metadata cache, significantly
improving random I/O performance.  Device reports minimal and preferable
size of the buffer.  The code limits it to 1% of physical RAM by default.
If the buffer can not be allocated or below minimal size, the device will
just have to work without it.

MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
2020-01-07 21:17:11 +00:00
mmel
61e28a20a8 Properly synchronize completion DMA buffers.
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.

MFC after:	3 weeks
2019-12-15 14:28:38 +00:00
imp
a3fcfb05ea Move to using bool instead of boolean_t
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.

Differential Revision: https://reviews.freebsd.org/D20999
2019-12-13 18:35:48 +00:00
imp
00ed045a2c Move reset to the interrutp processing stage
This trims the boot time a bit more for AWS and other platforms that have nvme
drives. There's no reason too do this inline. This has been in my tree a while,
but IIRC I talked to Jim Harris about this at one of our face to face meetings.

MFC After: 2 weeks
2019-12-11 22:51:02 +00:00
imp
ee3e21ead7 trackers always know what qpair they are on
Don't needlessly pass around qpair pointers when the tracker knows what
qpair it's on.  This will simplify code and make it easier to split
submission and completion queues in the future.

Signed-off-by: John Meneghini <johnm@netapp.com>
2019-12-06 22:12:39 +00:00
mav
37e8a0e005 Make nvme(4) driver some more NUMA aware.
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
 - Allocate most of queue pair memory from the domain it is bound to.
 - Bind callouts to the same CPUs as queue pair to avoid migrations.
 - Do not assign queue pairs to each SMT thread.  It just wasted
resources and increased lock congestions.
 - Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
 - If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-09-23 17:53:47 +00:00
imp
362659d41a Support doorbell strides != 0.
The NVMe standard (1.4) states

>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ...  For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.

However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.

MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
2019-09-04 20:08:36 +00:00
imp
854a74e65c Implement nvme suspend / resume for pci attachment
When we suspend, we need to properly shutdown the NVME controller. The
controller may go into D3 state (or may have the power removed), and
to properly flush the metadata to non-volatile RAM, we must complete a
normal shutdown. This consists of deleting the I/O queues and setting
the shutodown bit. We have to do some extra stuff to make sure we
reset the software state of the queues as well.

On resume, we have to reset the card twice, for reasons described in
the attach funcion. Once we've done that, we can restart the card. If
any of this fails, we'll fail the NVMe card, just like we do when a
reset fails.

Set is_resetting for the duration of the suspend / resume. This keeps
the reset taskqueue from running a concurrent reset, and also is
needed to prevent any hw completions from queueing more I/O to the
card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
that from the global state of the ctrlr. Wait for any pending reset to
finish. All queued I/O will get sent to the hardware as part of
nvme_ctrlr_start(), though the upper layers shouldn't send any
down. Disabling the qpairs is the other failsafe to ensure all I/O is
queued.

Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
confusion with all the other destroy functions.  It just removes the
queues in hardware, while the other _destroy_ functions tear down
driver data structures.

Split parts of the hardware reset function up so that I can
do part of the reset in suspsend. Split out the software disabling
of the qpairs into nvme_ctrlr_disable_qpairs.

Finally, fix a couple of spelling errors in comments related to
this.

Relnotes: Yes
MFC After: 1 week
Reviewed by: scottl@ (prior version)
Differential Revision: https://reviews.freebsd.org/D21493
2019-09-03 15:26:11 +00:00
imp
2d33613528 In nvme_completion_poll, add a sanity check to make sure that we complete the
polling within a second. Panic if we don't. All the commands that use this
interface should typically complete within a few tens to hundreds of
microseconds. Panic rather than return ETIMEDOUT because if the command somehow
does later complete, it will randomly corrupt memory. Also, it helps to get a
traceback from where the unexpected failure happens, rather than an infinite
loop.
2019-09-02 17:11:32 +00:00