BIO_READ and BIO_WRITE, we've handled this expanded syntax poorly in
drivers when the driver doesn't support a particular command. Do a
sweep and fix that.
Reported by: imp
SSD capacity in laptops is growing faster then RAM size, so my original
guess seems too low on second thought. Hopefully nobody will build large
array of those crappy SSDs.
MFC after: 2 weeks
X-MFC-with: 356474
This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about
1MB per 1GB on the devices I have) for its metadata cache, significantly
improving random I/O performance. Device reports minimal and preferable
size of the buffer. The code limits it to 1% of physical RAM by default.
If the buffer can not be allocated or below minimal size, the device will
just have to work without it.
MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems, Inc.
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.
MFC after: 3 weeks
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.
Differential Revision: https://reviews.freebsd.org/D20999
This trims the boot time a bit more for AWS and other platforms that have nvme
drives. There's no reason too do this inline. This has been in my tree a while,
but IIRC I talked to Jim Harris about this at one of our face to face meetings.
MFC After: 2 weeks
Don't needlessly pass around qpair pointers when the tracker knows what
qpair it's on. This will simplify code and make it easier to split
submission and completion queues in the future.
Signed-off-by: John Meneghini <johnm@netapp.com>
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
The NVMe standard (1.4) states
>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.
However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.
MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
When we suspend, we need to properly shutdown the NVME controller. The
controller may go into D3 state (or may have the power removed), and
to properly flush the metadata to non-volatile RAM, we must complete a
normal shutdown. This consists of deleting the I/O queues and setting
the shutodown bit. We have to do some extra stuff to make sure we
reset the software state of the queues as well.
On resume, we have to reset the card twice, for reasons described in
the attach funcion. Once we've done that, we can restart the card. If
any of this fails, we'll fail the NVMe card, just like we do when a
reset fails.
Set is_resetting for the duration of the suspend / resume. This keeps
the reset taskqueue from running a concurrent reset, and also is
needed to prevent any hw completions from queueing more I/O to the
card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
that from the global state of the ctrlr. Wait for any pending reset to
finish. All queued I/O will get sent to the hardware as part of
nvme_ctrlr_start(), though the upper layers shouldn't send any
down. Disabling the qpairs is the other failsafe to ensure all I/O is
queued.
Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
confusion with all the other destroy functions. It just removes the
queues in hardware, while the other _destroy_ functions tear down
driver data structures.
Split parts of the hardware reset function up so that I can
do part of the reset in suspsend. Split out the software disabling
of the qpairs into nvme_ctrlr_disable_qpairs.
Finally, fix a couple of spelling errors in comments related to
this.
Relnotes: Yes
MFC After: 1 week
Reviewed by: scottl@ (prior version)
Differential Revision: https://reviews.freebsd.org/D21493
polling within a second. Panic if we don't. All the commands that use this
interface should typically complete within a few tens to hundreds of
microseconds. Panic rather than return ETIMEDOUT because if the command somehow
does later complete, it will randomly corrupt memory. Also, it helps to get a
traceback from where the unexpected failure happens, rather than an infinite
loop.
dump support code, move the while loop into an inline function. These aren't
done in the fast path, so if the compiler choses to not inline, any performance
hit is tiny.
polled interface. Normally this would have the potential to corrupt stack memory
because the completion routines would run after we return. In this case,
however, we're doing a dump so it's safe for reasons explained in the comment.
While it worked with the kenrel, it wasn't working with the loader.
It failed to handle dependencies correctly. The reason for that is
that we never created a nvme module with the DRIVER_MODULE, but
instead a nvme_pci and nvme_ahci module. Create a real nvme module
that nvd can be dependent on so it can import the nvme symbols it
needs from there.
Arguably, nvd should just be a simple child of nvme, but transitioning
to that (and winning that argument given why it was done this way) is
beyond the scope of this change.
Reviewed by: jhb@
Differential Revision: https://reviews.freebsd.org/D21382
queues, don't try to tear them down in the ctrlr_destroy
path. Otherwise, we dereference queue structures that are NULL and we
trap.
This fix is incomplete: we leak IRQ and MSI resources when this
happens. That's preferable to a crash but still should be fixed.
load and people who pull in nvme/nvd from modules can't load nvd.ko
since it depends on nvme, not nvme_foo. The duplicate doesn't matter
since kldxref properly handles that case.
Turn off bus master after we detach the device (to match the prior
order). Release MSI after we're done detaching and have turned off
all the interrupts. Otherwise this may cause problems as other threads
race nvme_detach. This more closely matches the old order.
Reviewed by: mav@
Intel has created RST and many laptops from vendors like Lenovo and Asus. It's a
mechanism for creating multiple boot devices under windows. It effectively hides
the nvme drive inside of the ahci controller. The details are supposed to be a
trade secret. However, there's a reverse engineered Linux driver, and this
implements similar operations to allow nvme drives to attach. The ahci driver
attaches nvme children that proxy the remapped resources to the child. nvme_ahci
is just like nvme_pci, except it doesn't do the PCI specific things. That's
moved into ahci where appropriate.
When the nvme drive is remapped, MSI-x interrupts aren't forwarded (the linux
driver doesn't know how to use this either). INTx interrupts are used
instead. This is suboptimal, but usually sufficient for the laptops these parts
are in.
This is based loosely on https://www.spinics.net/lists/linux-ide/msg53364.html
submitted, but not accepted by, Linux. It was written by Dan Williams. These
changes were written from scratch by Olivier Houchard.
Submitted by: cognet@ (Olivier Houchard)
Nvme drives can be attached in a number of different ways. Separate out the PCI
attachment so that we can have other attachment types, like ahci and various
types of NVMeoF.
Submitted by: cognet@
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back. If there is a detach call in such state,
just stop all activity and free resources. If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This fixes possible double call of fail_fn, for example on hot removal.
It also allows ctrlr_fn to safely return NULL cookie in case of failure
and not get useless ns_fn or fail_fn call with NULL cookie later.
MFC after: 2 weeks
Namespace Optimal I/O Boundary field added in NVMe 1.3 and Namespace
Preferred Write Granularity added in 1.4 allow upper layers to align
I/Os for improved SSD performance and endurance.
I don't have hardware reportig those yet, but NPWG could probably be
reported by bhyve.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
NVMe reservations are quite alike to SCSI persistent reservations and
can be used in clustered setups with shared multiport storage.
MFC after: 10 days
Relnotes: yes
Sponsored by: iXsystems, Inc.
In particular: Changed Namespace List, Commands Supported and Effects,
Reservation Notification, Sanitize Status.
Add few new arguments to `nvmecontrol log` subcommand.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While very useful by itself, it also makes `nvmecontrol` not depend on
hardcoded device names parsing, that in its turn makes simple to take
nvdX (and potentially any other) device names as arguments.
Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them
interchangeable for management purposes.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
The timeout field in the CAPS register is defined to be 8 bits, so its type was
uint8_t. We recently started adding 1 to it to cope with rogue devices that
listed 0 timeout time (which is impossible). However, in so doing, other devices
that list 0xff (for a 2 minute timeout) were broken when adding 1
overflowed. Widen the type to be uint32_t like its source register to avoid the
issue.
Reported by: bapt@
While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.
Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.
Also convert it to a bool. While the rest of the driver isn't yet bool clean,
this will help.
Reviewed by: cem@
Differential Revision: https://reviews.freebsd.org/D20988
The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.
Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988
These macros make places where we extract these easier to read. The shift and
mask stuff is also a bit tedious and error prone. Start with the CAP_LO and
CAP_HI registers since their scope is somewhat constrained. This is style
chagne only, no functional changes.
Reviewed by: chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20979
Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1
costs little and copes with those NVMe drives that report '0' in this field
cheaply. This is consistent with what the Linux driver does as well.
Differentiate between PCI Express Endpoint devices and Root Complex
Integrated Endpoints in the nda driver. The Link Status and Capability
registers are not valid for Integrated Endpoints and should not be
displayed. The bhyve emulated NVMe device will advertise as being an
Integrated Endpoint.
Reviewed by: imp
Approved byL imp (mentor)
Differential Revision: https://reviews.freebsd.org/D20282
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.
Reviewed by: jhb and markj (both prior versions)
Differential Revision: https://reviews.freebsd.org/D20478
Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to
match init/uninit scenarios.
Signed-off-by: John Meneghini <johnm@netapp.com>
Submitted by: Michael Hordijk <hordijk@netapp.com>
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D19781
retries.
When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.
Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.
Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho