freebsd-nq

Author	SHA1	Message	Date
Warner Losh	f93b7f954e	Support doorbell strides != 0. The NVMe standard (1.4) states >>> 8.6 Doorbell Stride for Software Emulation >>> The doorbell stride,...is useful in software emulation of an NVM >>> Express controller. ... For hardware implementations of the NVM >>> Express interface, the expected doorbell stride value is 0h. However, hardware in the wild exists with a doorbell stride of 1 (meaning 8 byte separation). This change supports that hardware, as well as software emulators as envisioned in Section 8.6. Since this is the fast path, care has been taken to make this computation efficient. The bit of math to compute an offset for each is replaced by a memory load from cache of a pre-computed value. MFC After: 3 days Reviewed by: scottl@ Differential Revision: https://reviews.freebsd.org/D21514	2019-09-04 20:08:36 +00:00
Warner Losh	4d5475613e	Implement nvme suspend / resume for pci attachment When we suspend, we need to properly shutdown the NVME controller. The controller may go into D3 state (or may have the power removed), and to properly flush the metadata to non-volatile RAM, we must complete a normal shutdown. This consists of deleting the I/O queues and setting the shutodown bit. We have to do some extra stuff to make sure we reset the software state of the queues as well. On resume, we have to reset the card twice, for reasons described in the attach funcion. Once we've done that, we can restart the card. If any of this fails, we'll fail the NVMe card, just like we do when a reset fails. Set is_resetting for the duration of the suspend / resume. This keeps the reset taskqueue from running a concurrent reset, and also is needed to prevent any hw completions from queueing more I/O to the card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get that from the global state of the ctrlr. Wait for any pending reset to finish. All queued I/O will get sent to the hardware as part of nvme_ctrlr_start(), though the upper layers shouldn't send any down. Disabling the qpairs is the other failsafe to ensure all I/O is queued. Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid confusion with all the other destroy functions. It just removes the queues in hardware, while the other _destroy_ functions tear down driver data structures. Split parts of the hardware reset function up so that I can do part of the reset in suspsend. Split out the software disabling of the qpairs into nvme_ctrlr_disable_qpairs. Finally, fix a couple of spelling errors in comments related to this. Relnotes: Yes MFC After: 1 week Reviewed by: scottl@ (prior version) Differential Revision: https://reviews.freebsd.org/D21493	2019-09-03 15:26:11 +00:00
Warner Losh	31b11bb3f2	In nvme_completion_poll, add a sanity check to make sure that we complete the polling within a second. Panic if we don't. All the commands that use this interface should typically complete within a few tens to hundreds of microseconds. Panic rather than return ETIMEDOUT because if the command somehow does later complete, it will randomly corrupt memory. Also, it helps to get a traceback from where the unexpected failure happens, rather than an infinite loop.	2019-09-02 17:11:32 +00:00
Warner Losh	ab0681aac9	In all the places that we use the polled for completion interface, except crash dump support code, move the while loop into an inline function. These aren't done in the fast path, so if the compiler choses to not inline, any performance hit is tiny.	2019-09-02 17:11:27 +00:00
Warner Losh	fc68da4b4d	Add a brief comment explaining why we can return ETIMEDOUT from the call to the polled interface. Normally this would have the potential to corrupt stack memory because the completion routines would run after we return. In this case, however, we're doing a dump so it's safe for reasons explained in the comment.	2019-09-02 17:10:46 +00:00
Warner Losh	5f9e856e3a	It turns out the duplication is only mostly harmless. While it worked with the kenrel, it wasn't working with the loader. It failed to handle dependencies correctly. The reason for that is that we never created a nvme module with the DRIVER_MODULE, but instead a nvme_pci and nvme_ahci module. Create a real nvme module that nvd can be dependent on so it can import the nvme symbols it needs from there. Arguably, nvd should just be a simple child of nvme, but transitioning to that (and winning that argument given why it was done this way) is beyond the scope of this change. Reviewed by: jhb@ Differential Revision: https://reviews.freebsd.org/D21382	2019-08-23 22:52:58 +00:00
Warner Losh	8e61280bd9	When we have errors resetting the device before we allocate the queues, don't try to tear them down in the ctrlr_destroy path. Otherwise, we dereference queue structures that are NULL and we trap. This fix is incomplete: we leak IRQ and MSI resources when this happens. That's preferable to a crash but still should be fixed.	2019-08-22 21:56:11 +00:00
Warner Losh	2d43fab9c2	We need to define version 1 of nvme, not nvme_foo. Otherwise nvd won't load and people who pull in nvme/nvd from modules can't load nvd.ko since it depends on nvme, not nvme_foo. The duplicate doesn't matter since kldxref properly handles that case.	2019-08-22 21:12:51 +00:00
Warner Losh	ec743e0c33	Move releasing of resources to later Turn off bus master after we detach the device (to match the prior order). Release MSI after we're done detaching and have turned off all the interrupts. Otherwise this may cause problems as other threads race nvme_detach. This more closely matches the old order. Reviewed by: mav@	2019-08-22 20:09:32 +00:00
Warner Losh	acc48026b3	Remove stray line that was duplicated. Noticed by: rpokala@	2019-08-22 02:53:51 +00:00
Warner Losh	93289cfcd2	Create a AHCI attachment for nvme. Intel has created RST and many laptops from vendors like Lenovo and Asus. It's a mechanism for creating multiple boot devices under windows. It effectively hides the nvme drive inside of the ahci controller. The details are supposed to be a trade secret. However, there's a reverse engineered Linux driver, and this implements similar operations to allow nvme drives to attach. The ahci driver attaches nvme children that proxy the remapped resources to the child. nvme_ahci is just like nvme_pci, except it doesn't do the PCI specific things. That's moved into ahci where appropriate. When the nvme drive is remapped, MSI-x interrupts aren't forwarded (the linux driver doesn't know how to use this either). INTx interrupts are used instead. This is suboptimal, but usually sufficient for the laptops these parts are in. This is based loosely on https://www.spinics.net/lists/linux-ide/msg53364.html submitted, but not accepted by, Linux. It was written by Dan Williams. These changes were written from scratch by Olivier Houchard. Submitted by: cognet@ (Olivier Houchard)	2019-08-21 22:18:01 +00:00
Warner Losh	f182f928db	Separate the pci attachment from the rest of nvme Nvme drives can be attached in a number of different ways. Separate out the PCI attachment so that we can have other attachment types, like ahci and various types of NVMeoF. Submitted by: cognet@	2019-08-21 22:17:55 +00:00
Alexander Motin	71a2818142	Improve NVMe hot unplug handling. If device is unplugged from the system (CSTS register reads return 0xffffffff), it makes no sense to send any more recovery requests or expect any responses back. If there is a detach call in such state, just stop all activity and free resources. If there is no detach call (hot-plug is not supported), rely on normal timeout handling, but when it trigger controller reset, do not wait for impossible and quickly report failure. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-21 20:17:30 +00:00
Alexander Motin	51b92c1af6	Formalize NVMe controller consumer life cycle. This fixes possible double call of fail_fn, for example on hot removal. It also allows ctrlr_fn to safely return NULL cookie in case of failure and not get useless ns_fn or fail_fn call with NULL cookie later. MFC after: 2 weeks	2019-08-21 02:17:39 +00:00
Alexander Motin	97be8b969d	Report NOIOB and NPWG fields as stripe size. Namespace Optimal I/O Boundary field added in NVMe 1.3 and Namespace Preferred Write Granularity added in 1.4 allow upper layers to align I/Os for improved SSD performance and endurance. I don't have hardware reportig those yet, but NPWG could probably be reported by bhyve. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-14 16:12:03 +00:00
Alexander Motin	70d20ed34f	Add `nvmecontrol resv` to handle NVMe reservations. NVMe reservations are quite alike to SCSI persistent reservations and can be used in clustered setups with shared multiport storage. MFC after: 10 days Relnotes: yes Sponsored by: iXsystems, Inc.	2019-08-05 17:36:00 +00:00
Alexander Motin	a6d222eb68	Add more random bits from NVMe 1.4. MFC after: 2 weeks	2019-08-03 02:36:35 +00:00
Alexander Motin	6c99d1325e	Decode few more NVMe log pages. In particular: Changed Namespace List, Commands Supported and Effects, Reservation Notification, Sanitize Status. Add few new arguments to `nvmecontrol log` subcommand. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-02 20:16:21 +00:00
Alexander Motin	8dafbebdd7	Fix typo in r350529. MFC after: 2 weeks	2019-08-02 04:04:18 +00:00
Alexander Motin	90dfa8f0ac	Add more new fields and values from NVMe 1.4. MFC after: 2 weeks	2019-08-02 03:43:24 +00:00
Alexander Motin	a7bf63be69	Add IOCTL to translate nvdX into nvmeY and NSID. While very useful by itself, it also makes `nvmecontrol` not depend on hardcoded device names parsing, that in its turn makes simple to take nvdX (and potentially any other) device names as arguments. Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them interchangeable for management purposes. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-01 21:44:07 +00:00
Alexander Motin	8de2d8c009	Add some new fields and bits from NVMe 1.4. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-07-29 03:28:46 +00:00
Warner Losh	08a607e0f3	Widen the type for to. The timeout field in the CAPS register is defined to be 8 bits, so its type was uint8_t. We recently started adding 1 to it to cope with rogue devices that listed 0 timeout time (which is impossible). However, in so doing, other devices that list 0xff (for a 2 minute timeout) were broken when adding 1 overflowed. Widen the type to be uint32_t like its source register to avoid the issue. Reported by: bapt@	2019-07-25 20:26:21 +00:00
Warner Losh	5e83c2ffaa	Keep track of the number of commands that exhaust their retry limit. While we print failure messages on the console, sometimes logs are lost or overwhelmed. Keeping a count of how many times we've failed retriable commands helps get a magnitude of the problem.	2019-07-19 18:39:24 +00:00
Warner Losh	c37fc318c4	Keep track of the number of retried commands. Retried commands can indicate a performance degredation of an nvme drive. Keep track of the number of retries and report it out via sysctl, just like number of commands an interrupts.	2019-07-19 18:39:18 +00:00
Warner Losh	1071b50a65	Use sysctl + CTLRWTUN for hw.nvme.verbose_cmd_dump. Also convert it to a bool. While the rest of the driver isn't yet bool clean, this will help. Reviewed by: cem@ Differential Revision: https://reviews.freebsd.org/D20988	2019-07-19 00:32:56 +00:00
Warner Losh	c75bdc044d	Provide new tunable hw.nvme.verbose_cmd_dump The nvme drive dumps only the most relevant details about a command when it fails. However, there are times this is not sufficient (such as debugging weird issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1 in loader.conf will enable more complete debugging information about each command that fails. Reviewed by: rpokala Sponsored by: Netflix Differential Version: https://reviews.freebsd.org/D20988	2019-07-18 21:58:51 +00:00
Warner Losh	62d2cf1847	Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers. These macros make places where we extract these easier to read. The shift and mask stuff is also a bit tedious and error prone. Start with the CAP_LO and CAP_HI registers since their scope is somewhat constrained. This is style chagne only, no functional changes. Reviewed by: chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20979	2019-07-18 15:41:10 +00:00
Warner Losh	204498d7c2	Remove now-obsolete comment.	2019-07-17 20:43:14 +00:00
Warner Losh	dc9df3a59d	Assume that the timeout value from the capacity is 1-based Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1 costs little and copes with those NVMe drives that report '0' in this field cheaply. This is consistent with what the Linux driver does as well.	2019-07-16 22:55:30 +00:00
Chuck Tuffli	b1f1471064	Fix nda(4) PCIe link status output Differentiate between PCI Express Endpoint devices and Root Complex Integrated Endpoints in the nda driver. The Link Status and Capability registers are not valid for Integrated Endpoints and should not be displayed. The bhyve emulated NVMe device will advertise as being an Integrated Endpoint. Reviewed by: imp Approved byL imp (mentor) Differential Revision: https://reviews.freebsd.org/D20282	2019-06-07 18:34:48 +00:00
Warner Losh	d0aaeffdb4	Since a fatal trap can happen at aribtrary times, don't panic when the completions are not in a consistent state. Cope with the different places the normal I/O completion polling thread can be interrupted and then re-entered during a kernel panic + dump. Reviewed by: jhb and markj (both prior versions) Differential Revision: https://reviews.freebsd.org/D20478	2019-06-01 15:37:44 +00:00
Warner Losh	9835d216d8	rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairs Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to match init/uninit scenarios. Signed-off-by: John Meneghini <johnm@netapp.com> Submitted by: Michael Hordijk <hordijk@netapp.com> Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D19781	2019-05-08 20:18:11 +00:00
Alexander Motin	1aed499575	Decode Deallocate Logical Block Features. MFC after: 1 week	2019-05-05 15:47:21 +00:00
Warner Losh	2ffd6fce5b	Don't print all the I/O we abort on a reset, unless we're out of retries. When resetting the controller, we abort I/O. Prior to this fix, we printed a ton of abort messages for I/O that we're going to retry. This imparts no useful information. Stop printing them unless our retry count is exhausted. Clarify code for when we don't retry, and remove useless arg to a routine that's always called with it as 'true'. All the other debug is still printed (including multiple reset messages if we have multiple timeouts before the taskqueue runs the actual reset) so that we know when we reset. Reviewed by: jimharris@, chuck@ Differential Revision: https://reviews.freebsd.org/D19431	2019-03-09 01:18:16 +00:00
Warner Losh	95108cadbc	Add ABORTED_BY_REQUEST to the list of things we look at DNR bit and tell why to comment (code already does this)	2019-03-03 03:36:33 +00:00
Warner Losh	45d7e233a5	Unconditionally support unmapped BIOs. This was another shim for supporting older kernels. However, all supported versions of FreeBSD have unmapped I/Os (as do several that have gone EOL), remove it. It's unlikely the driver would work on the older kernels anyway at this point.	2019-02-27 22:16:59 +00:00
Warner Losh	d706306d49	Remove #ifdef code to support FreeBSD versions that haven't been supported in years. A number of changes have been made to the driver that likely wouldn't work on those older versions that aren't properly ifdef'd and it's project policy to GC such code once it is stale.	2019-02-27 22:05:01 +00:00
Warner Losh	52467047aa	Regularize the Netflix copyright Use recent best practices for Copyright form at the top of the license: 1. Remove all the All Rights Reserved clauses on our stuff. Where we piggybacked others, use a separate line to make things clear. 2. Use "Netflix, Inc." everywhere. 3. Use a single line for the copyright for grep friendliness. 4. Use date ranges in all places for our stuff. Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)	2019-02-04 21:28:25 +00:00
Gleb Smirnoff	756a541279	Allocate pager bufs from UMA instead of 80-ish mutex protected linked list. o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many pbufs are we going to have set. In various subsystems that are going to utilize pbufs create private zones via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(), and sets a limit on created zone. After startup preallocate pbufs according to requirements of all pbuf zones. Subsystems that used to have a private limit with old allocator now have private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS, swap, vnode pager. The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9), aio(4). They should have their private limits, but changing that is out of scope of this commit. o Fetch tunable value of kern.nswbuf from init_param2() and while here move NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only this option. Default values aren't touched by this commit, but they probably should be reviewed wrt to modern hardware. This change removes a tight bottleneck from sendfile(2) operation, that uses pbufs in vnode pager. Other pagers also would benefit from faster allocation. Together with: gallatin Tested by: pho	2019-01-15 01:02:16 +00:00
Chuck Tuffli	5312bd8eb0	Add NVMe drive to NOIOB quirk list Dell-branded Intel P4600 NVMe drives benefit from NVMe 1.3's NOIOB feature. Unfortunately just like Intel DC P4500s, they don't advertise themselves as benefiting from this... This changes adds P4600s to the existing list of old drives which benefit from striping. PR: 233969 Submitted by: David Fugate <dave.fugate@gmail.com> Reviewed by: imp, mav Approved by: imp (mentor) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18772	2019-01-08 15:30:56 +00:00
Alexander Motin	a646135771	Add descriptions to NVMe interrupts. MFC after: 1 month	2018-12-26 23:41:52 +00:00
Alexander Motin	511662d083	Remove CAM SIM lock from NVMe SIM. CAM does not require SIM lock since FreeBSD 10.4, and NVMe code never required it at all, using per-queue locks instead. This formally allows parallel request submission in CAM mode as much as single per-device and per-queue locks of CAM allow. MFC after: 1 month	2018-12-24 23:28:11 +00:00
Chuck Tuffli	87b3975e36	nda(4) fix check for Dataset Management support In the nda(4) driver, only set DISKFLAG_CANDELETE (a.k.a. can support BIO_DELETE) if the drive supports Dataset Management. There are reports that without this check, VMWare Workstation does not work reliably. Fix is to check the ONCS field in the NVMe Controller Data structure for support. This check previously existed but did not survive the big-endian changes. Reported by: yuripv@yuripv.net Reviewed by: imp, mav, jimharris Approved by: imp (mentor) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D18493	2018-12-13 13:25:37 +00:00
Warner Losh	91182bcfb6	Even though they are reserved, cdw2 and cdw3 can be set via nvme-cli (and soon nvmecontrol). Go ahead and copy them into rsvd2 and rsvd3. Sponsored by: Netflix	2018-12-07 21:58:08 +00:00
Warner Losh	14343799dc	Remove do-nothing nvme_modevent. nvme_modevent no longer does anything interesting, remove it. Sponsored by: Netflix	2018-11-16 16:51:44 +00:00
Warner Losh	85939654f3	Use atomic_load_acq_int() here too to poll done, ala r328521	2018-11-13 22:41:20 +00:00
Warner Losh	09efa3dfb2	Put a workaround in for command timeout malfunctioning At least one NVMe drive has a bug that makeing the Command Time Out PCIe feature unreliable. The workaround is to disable this feature. The driver wouldn't deal correctly with a timeout anyway. Only do this for drives that are known bad. Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D17708	2018-10-26 14:27:37 +00:00
Chuck Tuffli	9544e6dcf1	Make NVMe compatible with the original API The original NVMe API used bit-fields to represent fields in data structures defined by the specification (e.g. the op-code in the command data structure). The implementation targeted x86_64 processors and defined the bit fields for little endian dwords (i.e. 32 bits). This approach does not work as-is for big endian architectures and was changed to use a combination of bit shifts and masks to support PowerPC. Unfortunately, this changed the NVMe API and forces #ifdef's based on the OS revision level in user space code. This change reverts to something that looks like the original API, but it uses bytes instead of bit-fields inside the packed command structure. As a bonus, this works as-is for both big and little endian CPU architectures. Bump __FreeBSD_version to 1200081 due to API change Reviewed by: imp, kbowling, smh, mav Approved by: imp (mentor) Differential Revision: https://reviews.freebsd.org/D16404	2018-08-22 04:29:24 +00:00
Justin Hibbits	2e0090af65	nvme(4): Add bus_dmamap_sync() at the end of the request path Summary: Some architectures, in this case powerpc64, need explicit synchronization barriers vs device accesses. Prior to this change, when running 'make buildworld -j72' on a 18-core (72-thread) POWER9, I would see controller resets often. With this change, I don't see these resets messages, though another tester still does, for yet to be determined reasons, so this may not be a complete fix. Additionally, I see a ~5-10% speed up in buildworld times, likely due to not needing to reset the controller. Reviewed By: jimharris Differential Revision: https://reviews.freebsd.org/D16570	2018-08-03 20:04:06 +00:00

1 2 3 4 5

250 Commits