If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared. If still no luck, fall back to shared INTx.
This provides maximal flexibility in some limited scenarios. For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.
MFC after: 1 week
To guard against the ill effects of a spurious interrupt during
construction (or one that was bogusly pending), enable interrupts after
the qpair is completely constructed. Otherwise, we can die with null
pointer dereferences in nvme_qpair_process_completions. This has been
observed in at least one pre-release NVMe drive where the MSIX interrupt
fired while the queue was being created, before we'd started the NVMe
controller card.
The alternative of only turning on the interrupts after the rest was
tried, but was insufficient to work around this bug and made the code
more complicated w/o benefit.
Reviewed by: mav, chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31182
New Samsung 980 SSDs report Namespace Preferred Write Alignment of
8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB).
My quick tests show that 16KB is a minimal sequential write size
when the SSD reaches peak IOPS, so writing much less is very slow.
But writing slightly less or slightly more does not change much,
so it seems not so much a size granularity as minimum I/O size.
Thinking about different stripesize consumers:
- Partition alignment should be based on NPWA by definition.
- ZFS ashift in part of forcing alignment of all I/Os should also
be based on NPWA. In part of forcing size granularity, if really
needed, it may be set to NPWG, but too big value can make ZFS too
space-inefficient, and the 16KB is actually the biggest supported
value there now.
- ZFS recordsize/volblocksize could potentially be tuned up toward
NPWG to work as I/O size granularity, but enabled compression makes
it too fuzzy. And those are normally user-configurable things.
- ZFS I/O aggregation code could definitely use Optimal Write Size
value and may be NPWG, but we don't have fields in GEOM now to report
the minimal and optimal I/O sizes, and even maximal is not reported
outside GEOM DISK to be used by ZFS.
MFC after: 1 week
Coherently read the phase bit of the status completion record. We loop
over the completion record array, looking for all the transactions in
the same phase that have been completed. In doing that, we have to be
careful to read the status field first, and if it indicates a complete
record, we need to read and process that record. Otherwise, the host
might be overtaken by device when reading this completion record,
leading to a mistaken belief that the record is in phase. This leads to
the code using old values and looking at an already completed entry, which
has no current tracker.
To work around this problem, we read the status and make sure it is in
phase, we then re-read the entire completion record guaranteeing it's
complete, valid, and consistent . In addition we resync the dmatag to
reflect changes since the prior loop for the bouncing dma case.
Reviewed by: jrtc27@, chuck@
Found by: jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31002
Remove __packed from nvme_command, nvme_completion and
nvme_dsm_trim. Add super-alignment to nvme_completion since it's always
at least that aligned in hardware (and in our existing uses of it
embedded in structures). It generates better code in
nvme_qpair_process_completions on riscv64 because otherwise the ABI
assumes a 4-byte alignment, and the same on all other platforms.
Reviewed by: jrtc27@, mav@, chuck@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31001
Put the { on the same line as the struct nvme_foo when we define these
structures. It's FreeBSD standard and these were inconsistent.
Sponsored by: Netflix
Part of the nvme recovery process for errors is to reset the
card. Sometimes, this results in failing the entire controller. When nda
is in use, we free the sim, which will sleep until all the I/O has
completed. However, with only one thread, the request fail task never
runs once the reset thread sleeps here. Create two threads to allow I/O
to fail until it's all processed and the reset task can proceed.
This is a temporary kludge until I can work out questions that arose
during the review, not least is what was the race that queueing to a
failure task solved. The original commit is vague and other error paths
in the same context do a direct failure. I'll investigate that more
completely before committing changing that to a direct failure. mav@
raised this issue during the review, but didn't otherwise object.
Multiple threads, though, solve the problem in the mean time until other
such means can be perfected.
Reviewed by: jhb@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30366
Under geom(4) nvme_ns_bio_process() is on the path where sleep
is prohibited as g_io_shedule_down() calls THREAD_NO_SLEEPNG()
before geom->start().
Reviewed By: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29539
In some cases like broken hardware nvme(4) may wait minutes for
controller response before timeout. Doing so in a tight spin loop
made whole system unresponsive.
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29309
Sponsored by: iXsystems, Inc.
nvme drives are configured early in boot. However, a number of the configuration
steps takes which take a while, so we defer those to a config intrhook that runs
before the root filesystem is mounted. At the same time, the PCI hot plug wakes
up and tests the status of the card. It may decide that the card has gone away
and deletes the child. As part of that process nvme_detach is called. If this
call happens after the config_intrhook starts to run, but before it is finished,
there's a race where we can tear down the device's soft state while the
config_intrhook is still using it. Use the new config_intrhook_drain to
disestablish the hook. Either it will be removed w/o running, or the routine
will wait for it to finish. This closes the race and allows safe hotplug at any
time, even very early in boot.
Sponsored by: Netflix, Inc
Reviewed by: jhb, mav
Differential Revision: https://reviews.freebsd.org/D29006
The NVMe byte-swap routines for big-endian platforms used memcpy() to
move the unaligned 64-bit value into a temp register to byte swap it.
Instead of introducing a dependency, manually byte-swap the values in
place.
Point hat: me
Add decoding of the Device Self-test log page and the ability to start
or abort a test.
Reviewed by: imp, mav
Tested by: Muhammad Ahmad <muhammad.ahmad@seagate.com>
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D27517
bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can
be DMA'd immediately after it returns. The details differ, but this
mirrors atomic thread release semantics, at least for the buffers
synced.
For non-x86 platforms, bus_dmamap_sync() has the right syncing and
fences. So in the past, wmb() had been omitted for them.
For x86 platforms, the memory ordering is already strong enough to
ensure DMA to the device sees the current contents. As such, we don't
need the wmb() here. It translates to an sfence which is only needed
for writes to regions that have the write combining attribute set or
when some exotic opcodes are used. The nvme driver does neither of
these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we
can be assured any optimizer won't reorder the bus_dmamap_sync and the
bus_space_write operations. The wmb() was a vestiage of the pre-busdma
version initially committed to the tree.
Reviewed by: kib@, gallatin@, chuck@, mav@
Differential Revision: https://reviews.freebsd.org/D27448
- in nvme_qpair_process_completions() do dma sync before completion buffer
is used.
- in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm
and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure
that all CPU stores are visible to external (including DMA) observers.
- Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems,
buffers continuously owned (and accessed) by DMA must be allocated with this
flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems
(or coherent buses in mixed systems).
MFC after: 4 weeks
Reviewed by: mav, imp
Differential Revision: https://reviews.freebsd.org/D27446
These swapping functions violate BUSDMA contract - we cannot write
to armed (by bus_dmamap_sync(PRE_..)) buffers. Remove them at least
from little endian machines until a better solution will be developed.
Reviewed by: imp
MFC after: 3 weeks
With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.
On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs. The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.
Reviewed by: imp
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
Both the size (128 bytes) and ephemeral nature of allocations make it a great
fit for malloc.
A dedicated zone unnecessarily avoids sharing buckets with 128-byte objects.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D27103
Remove code that supported pre-2011 kernels. CTLTYPE_S64 was defined
in rev 217616. All supported branches have it, so remove its compat
definition as OBE.
Instead, add arguments to vmapbuf. Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)
In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.
Suggested by: jhb
Reviewed by: imp, jhb
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26784
This field was not in specs when the driver was written, but now there
are SSDs with the reported latency of 10s, where hardcoded value of 5s
seems to be not enough sometimes, causing shutdown timeout messages.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches. They no longer serve any
purpose since UMA now provides their functionality by default. Remove
them to simplyify the kernel memory allocator interfaces a bit.
Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25937
Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests. If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.
In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
All callers are currently filtering bad nsid to this function,
however, we'll have undefined behavior if that's not true. Add the
KASSERT to prevent that.