freebsd-dev

Author	SHA1	Message	Date
Scott Long	d176b8039e	Ever since the block layer expanded its command syntax beyond just BIO_READ and BIO_WRITE, we've handled this expanded syntax poorly in drivers when the driver doesn't support a particular command. Do a sweep and fix that. Reported by: imp	2020-02-07 09:22:08 +00:00
Alexander Motin	b2cdfb72f4	Fix copy-paste bug in HMB free code. MFC after: 2 weeks X-MFC-with: r356474	2020-01-08 18:26:23 +00:00
Alexander Motin	6de4e458fa	Minor adjustments to r356474 and r356480. Reported by: jkim, imp MFC after: 2 weeks X-MFC-with: r356474	2020-01-07 23:29:54 +00:00
Alexander Motin	1c7dd40e58	Increate HMB limit from 1% to 5%. SSD capacity in laptops is growing faster then RAM size, so my original guess seems too low on second thought. Hopefully nobody will build large array of those crappy SSDs. MFC after: 2 weeks X-MFC-with: 356474	2020-01-07 23:10:38 +00:00
Alexander Motin	67abaee9fc	Add Host Memory Buffer support to nvme(4). This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about 1MB per 1GB on the devices I have) for its metadata cache, significantly improving random I/O performance. Device reports minimal and preferable size of the buffer. The code limits it to 1% of physical RAM by default. If the buffer can not be allocated or below minimal size, the device will just have to work without it. MFC after: 2 weeks Relnotes: yes Sponsored by: iXsystems, Inc.	2020-01-07 21:17:11 +00:00
Michal Meloun	0a4b14e8cc	Properly synchronize completion DMA buffers. Within command completion processing the callback function may access DMAed data buffer. Synchronize it before use, not after. This allows to use NVMe disk on non-DMA coherent arm64 system. MFC after: 3 weeks	2019-12-15 14:28:38 +00:00
Warner Losh	7588c6cc36	Move to using bool instead of boolean_t While there are subtle semantic differences between bool and boolean_t, none of them matter in these cases. Prefer true/false when dealing with bool type. Preserve a couple of TRUEs since they are passed into int args into CAM. Preserve a couple of FALSEs when used for status.done, an int. Differential Revision: https://reviews.freebsd.org/D20999	2019-12-13 18:35:48 +00:00
Warner Losh	66e5985084	Move reset to the interrutp processing stage This trims the boot time a bit more for AWS and other platforms that have nvme drives. There's no reason too do this inline. This has been in my tree a while, but IIRC I talked to Jim Harris about this at one of our face to face meetings. MFC After: 2 weeks	2019-12-11 22:51:02 +00:00
Warner Losh	43393e8b2c	trackers always know what qpair they are on Don't needlessly pass around qpair pointers when the tracker knows what qpair it's on. This will simplify code and make it easier to split submission and completion queues in the future. Signed-off-by: John Meneghini <johnm@netapp.com>	2019-12-06 22:12:39 +00:00
Alexander Motin	1eab19cbec	Make nvme(4) driver some more NUMA aware. - For each queue pair precalculate CPU and domain it is bound to. If queue pairs are not per-CPU, then use the domain of the device. - Allocate most of queue pair memory from the domain it is bound to. - Bind callouts to the same CPUs as queue pair to avoid migrations. - Do not assign queue pairs to each SMT thread. It just wasted resources and increased lock congestions. - Remove fixed multiplier of CPUs per queue pair, spread them even. This allows to use more queue pairs in some hardware configurations. - If queue pair serves multiple CPUs, bind different NVMe devices to different CPUs. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-23 17:53:47 +00:00
Warner Losh	f93b7f954e	Support doorbell strides != 0. The NVMe standard (1.4) states >>> 8.6 Doorbell Stride for Software Emulation >>> The doorbell stride,...is useful in software emulation of an NVM >>> Express controller. ... For hardware implementations of the NVM >>> Express interface, the expected doorbell stride value is 0h. However, hardware in the wild exists with a doorbell stride of 1 (meaning 8 byte separation). This change supports that hardware, as well as software emulators as envisioned in Section 8.6. Since this is the fast path, care has been taken to make this computation efficient. The bit of math to compute an offset for each is replaced by a memory load from cache of a pre-computed value. MFC After: 3 days Reviewed by: scottl@ Differential Revision: https://reviews.freebsd.org/D21514	2019-09-04 20:08:36 +00:00
Warner Losh	4d5475613e	Implement nvme suspend / resume for pci attachment When we suspend, we need to properly shutdown the NVME controller. The controller may go into D3 state (or may have the power removed), and to properly flush the metadata to non-volatile RAM, we must complete a normal shutdown. This consists of deleting the I/O queues and setting the shutodown bit. We have to do some extra stuff to make sure we reset the software state of the queues as well. On resume, we have to reset the card twice, for reasons described in the attach funcion. Once we've done that, we can restart the card. If any of this fails, we'll fail the NVMe card, just like we do when a reset fails. Set is_resetting for the duration of the suspend / resume. This keeps the reset taskqueue from running a concurrent reset, and also is needed to prevent any hw completions from queueing more I/O to the card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get that from the global state of the ctrlr. Wait for any pending reset to finish. All queued I/O will get sent to the hardware as part of nvme_ctrlr_start(), though the upper layers shouldn't send any down. Disabling the qpairs is the other failsafe to ensure all I/O is queued. Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid confusion with all the other destroy functions. It just removes the queues in hardware, while the other _destroy_ functions tear down driver data structures. Split parts of the hardware reset function up so that I can do part of the reset in suspsend. Split out the software disabling of the qpairs into nvme_ctrlr_disable_qpairs. Finally, fix a couple of spelling errors in comments related to this. Relnotes: Yes MFC After: 1 week Reviewed by: scottl@ (prior version) Differential Revision: https://reviews.freebsd.org/D21493	2019-09-03 15:26:11 +00:00
Warner Losh	31b11bb3f2	In nvme_completion_poll, add a sanity check to make sure that we complete the polling within a second. Panic if we don't. All the commands that use this interface should typically complete within a few tens to hundreds of microseconds. Panic rather than return ETIMEDOUT because if the command somehow does later complete, it will randomly corrupt memory. Also, it helps to get a traceback from where the unexpected failure happens, rather than an infinite loop.	2019-09-02 17:11:32 +00:00
Warner Losh	ab0681aac9	In all the places that we use the polled for completion interface, except crash dump support code, move the while loop into an inline function. These aren't done in the fast path, so if the compiler choses to not inline, any performance hit is tiny.	2019-09-02 17:11:27 +00:00
Warner Losh	fc68da4b4d	Add a brief comment explaining why we can return ETIMEDOUT from the call to the polled interface. Normally this would have the potential to corrupt stack memory because the completion routines would run after we return. In this case, however, we're doing a dump so it's safe for reasons explained in the comment.	2019-09-02 17:10:46 +00:00
Warner Losh	5f9e856e3a	It turns out the duplication is only mostly harmless. While it worked with the kenrel, it wasn't working with the loader. It failed to handle dependencies correctly. The reason for that is that we never created a nvme module with the DRIVER_MODULE, but instead a nvme_pci and nvme_ahci module. Create a real nvme module that nvd can be dependent on so it can import the nvme symbols it needs from there. Arguably, nvd should just be a simple child of nvme, but transitioning to that (and winning that argument given why it was done this way) is beyond the scope of this change. Reviewed by: jhb@ Differential Revision: https://reviews.freebsd.org/D21382	2019-08-23 22:52:58 +00:00
Warner Losh	8e61280bd9	When we have errors resetting the device before we allocate the queues, don't try to tear them down in the ctrlr_destroy path. Otherwise, we dereference queue structures that are NULL and we trap. This fix is incomplete: we leak IRQ and MSI resources when this happens. That's preferable to a crash but still should be fixed.	2019-08-22 21:56:11 +00:00
Warner Losh	2d43fab9c2	We need to define version 1 of nvme, not nvme_foo. Otherwise nvd won't load and people who pull in nvme/nvd from modules can't load nvd.ko since it depends on nvme, not nvme_foo. The duplicate doesn't matter since kldxref properly handles that case.	2019-08-22 21:12:51 +00:00
Warner Losh	ec743e0c33	Move releasing of resources to later Turn off bus master after we detach the device (to match the prior order). Release MSI after we're done detaching and have turned off all the interrupts. Otherwise this may cause problems as other threads race nvme_detach. This more closely matches the old order. Reviewed by: mav@	2019-08-22 20:09:32 +00:00
Warner Losh	acc48026b3	Remove stray line that was duplicated. Noticed by: rpokala@	2019-08-22 02:53:51 +00:00
Warner Losh	93289cfcd2	Create a AHCI attachment for nvme. Intel has created RST and many laptops from vendors like Lenovo and Asus. It's a mechanism for creating multiple boot devices under windows. It effectively hides the nvme drive inside of the ahci controller. The details are supposed to be a trade secret. However, there's a reverse engineered Linux driver, and this implements similar operations to allow nvme drives to attach. The ahci driver attaches nvme children that proxy the remapped resources to the child. nvme_ahci is just like nvme_pci, except it doesn't do the PCI specific things. That's moved into ahci where appropriate. When the nvme drive is remapped, MSI-x interrupts aren't forwarded (the linux driver doesn't know how to use this either). INTx interrupts are used instead. This is suboptimal, but usually sufficient for the laptops these parts are in. This is based loosely on https://www.spinics.net/lists/linux-ide/msg53364.html submitted, but not accepted by, Linux. It was written by Dan Williams. These changes were written from scratch by Olivier Houchard. Submitted by: cognet@ (Olivier Houchard)	2019-08-21 22:18:01 +00:00
Warner Losh	f182f928db	Separate the pci attachment from the rest of nvme Nvme drives can be attached in a number of different ways. Separate out the PCI attachment so that we can have other attachment types, like ahci and various types of NVMeoF. Submitted by: cognet@	2019-08-21 22:17:55 +00:00
Alexander Motin	71a2818142	Improve NVMe hot unplug handling. If device is unplugged from the system (CSTS register reads return 0xffffffff), it makes no sense to send any more recovery requests or expect any responses back. If there is a detach call in such state, just stop all activity and free resources. If there is no detach call (hot-plug is not supported), rely on normal timeout handling, but when it trigger controller reset, do not wait for impossible and quickly report failure. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-21 20:17:30 +00:00
Alexander Motin	51b92c1af6	Formalize NVMe controller consumer life cycle. This fixes possible double call of fail_fn, for example on hot removal. It also allows ctrlr_fn to safely return NULL cookie in case of failure and not get useless ns_fn or fail_fn call with NULL cookie later. MFC after: 2 weeks	2019-08-21 02:17:39 +00:00
Alexander Motin	97be8b969d	Report NOIOB and NPWG fields as stripe size. Namespace Optimal I/O Boundary field added in NVMe 1.3 and Namespace Preferred Write Granularity added in 1.4 allow upper layers to align I/Os for improved SSD performance and endurance. I don't have hardware reportig those yet, but NPWG could probably be reported by bhyve. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-14 16:12:03 +00:00
Alexander Motin	70d20ed34f	Add `nvmecontrol resv` to handle NVMe reservations. NVMe reservations are quite alike to SCSI persistent reservations and can be used in clustered setups with shared multiport storage. MFC after: 10 days Relnotes: yes Sponsored by: iXsystems, Inc.	2019-08-05 17:36:00 +00:00
Alexander Motin	a6d222eb68	Add more random bits from NVMe 1.4. MFC after: 2 weeks	2019-08-03 02:36:35 +00:00
Alexander Motin	6c99d1325e	Decode few more NVMe log pages. In particular: Changed Namespace List, Commands Supported and Effects, Reservation Notification, Sanitize Status. Add few new arguments to `nvmecontrol log` subcommand. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-02 20:16:21 +00:00
Alexander Motin	8dafbebdd7	Fix typo in r350529. MFC after: 2 weeks	2019-08-02 04:04:18 +00:00
Alexander Motin	90dfa8f0ac	Add more new fields and values from NVMe 1.4. MFC after: 2 weeks	2019-08-02 03:43:24 +00:00
Alexander Motin	a7bf63be69	Add IOCTL to translate nvdX into nvmeY and NSID. While very useful by itself, it also makes `nvmecontrol` not depend on hardcoded device names parsing, that in its turn makes simple to take nvdX (and potentially any other) device names as arguments. Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them interchangeable for management purposes. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-01 21:44:07 +00:00
Alexander Motin	8de2d8c009	Add some new fields and bits from NVMe 1.4. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-07-29 03:28:46 +00:00
Warner Losh	08a607e0f3	Widen the type for to. The timeout field in the CAPS register is defined to be 8 bits, so its type was uint8_t. We recently started adding 1 to it to cope with rogue devices that listed 0 timeout time (which is impossible). However, in so doing, other devices that list 0xff (for a 2 minute timeout) were broken when adding 1 overflowed. Widen the type to be uint32_t like its source register to avoid the issue. Reported by: bapt@	2019-07-25 20:26:21 +00:00
Warner Losh	5e83c2ffaa	Keep track of the number of commands that exhaust their retry limit. While we print failure messages on the console, sometimes logs are lost or overwhelmed. Keeping a count of how many times we've failed retriable commands helps get a magnitude of the problem.	2019-07-19 18:39:24 +00:00
Warner Losh	c37fc318c4	Keep track of the number of retried commands. Retried commands can indicate a performance degredation of an nvme drive. Keep track of the number of retries and report it out via sysctl, just like number of commands an interrupts.	2019-07-19 18:39:18 +00:00
Warner Losh	1071b50a65	Use sysctl + CTLRWTUN for hw.nvme.verbose_cmd_dump. Also convert it to a bool. While the rest of the driver isn't yet bool clean, this will help. Reviewed by: cem@ Differential Revision: https://reviews.freebsd.org/D20988	2019-07-19 00:32:56 +00:00
Warner Losh	c75bdc044d	Provide new tunable hw.nvme.verbose_cmd_dump The nvme drive dumps only the most relevant details about a command when it fails. However, there are times this is not sufficient (such as debugging weird issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1 in loader.conf will enable more complete debugging information about each command that fails. Reviewed by: rpokala Sponsored by: Netflix Differential Version: https://reviews.freebsd.org/D20988	2019-07-18 21:58:51 +00:00
Warner Losh	62d2cf1847	Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers. These macros make places where we extract these easier to read. The shift and mask stuff is also a bit tedious and error prone. Start with the CAP_LO and CAP_HI registers since their scope is somewhat constrained. This is style chagne only, no functional changes. Reviewed by: chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20979	2019-07-18 15:41:10 +00:00
Warner Losh	204498d7c2	Remove now-obsolete comment.	2019-07-17 20:43:14 +00:00
Warner Losh	dc9df3a59d	Assume that the timeout value from the capacity is 1-based Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1 costs little and copes with those NVMe drives that report '0' in this field cheaply. This is consistent with what the Linux driver does as well.	2019-07-16 22:55:30 +00:00
Chuck Tuffli	b1f1471064	Fix nda(4) PCIe link status output Differentiate between PCI Express Endpoint devices and Root Complex Integrated Endpoints in the nda driver. The Link Status and Capability registers are not valid for Integrated Endpoints and should not be displayed. The bhyve emulated NVMe device will advertise as being an Integrated Endpoint. Reviewed by: imp Approved byL imp (mentor) Differential Revision: https://reviews.freebsd.org/D20282	2019-06-07 18:34:48 +00:00
Warner Losh	d0aaeffdb4	Since a fatal trap can happen at aribtrary times, don't panic when the completions are not in a consistent state. Cope with the different places the normal I/O completion polling thread can be interrupted and then re-entered during a kernel panic + dump. Reviewed by: jhb and markj (both prior versions) Differential Revision: https://reviews.freebsd.org/D20478	2019-06-01 15:37:44 +00:00
Warner Losh	9835d216d8	rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairs Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to match init/uninit scenarios. Signed-off-by: John Meneghini <johnm@netapp.com> Submitted by: Michael Hordijk <hordijk@netapp.com> Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D19781	2019-05-08 20:18:11 +00:00
Alexander Motin	1aed499575	Decode Deallocate Logical Block Features. MFC after: 1 week	2019-05-05 15:47:21 +00:00
Warner Losh	2ffd6fce5b	Don't print all the I/O we abort on a reset, unless we're out of retries. When resetting the controller, we abort I/O. Prior to this fix, we printed a ton of abort messages for I/O that we're going to retry. This imparts no useful information. Stop printing them unless our retry count is exhausted. Clarify code for when we don't retry, and remove useless arg to a routine that's always called with it as 'true'. All the other debug is still printed (including multiple reset messages if we have multiple timeouts before the taskqueue runs the actual reset) so that we know when we reset. Reviewed by: jimharris@, chuck@ Differential Revision: https://reviews.freebsd.org/D19431	2019-03-09 01:18:16 +00:00
Warner Losh	95108cadbc	Add ABORTED_BY_REQUEST to the list of things we look at DNR bit and tell why to comment (code already does this)	2019-03-03 03:36:33 +00:00
Warner Losh	45d7e233a5	Unconditionally support unmapped BIOs. This was another shim for supporting older kernels. However, all supported versions of FreeBSD have unmapped I/Os (as do several that have gone EOL), remove it. It's unlikely the driver would work on the older kernels anyway at this point.	2019-02-27 22:16:59 +00:00
Warner Losh	d706306d49	Remove #ifdef code to support FreeBSD versions that haven't been supported in years. A number of changes have been made to the driver that likely wouldn't work on those older versions that aren't properly ifdef'd and it's project policy to GC such code once it is stale.	2019-02-27 22:05:01 +00:00
Warner Losh	52467047aa	Regularize the Netflix copyright Use recent best practices for Copyright form at the top of the license: 1. Remove all the All Rights Reserved clauses on our stuff. Where we piggybacked others, use a separate line to make things clear. 2. Use "Netflix, Inc." everywhere. 3. Use a single line for the copyright for grep friendliness. 4. Use date ranges in all places for our stuff. Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)	2019-02-04 21:28:25 +00:00
Gleb Smirnoff	756a541279	Allocate pager bufs from UMA instead of 80-ish mutex protected linked list. o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many pbufs are we going to have set. In various subsystems that are going to utilize pbufs create private zones via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(), and sets a limit on created zone. After startup preallocate pbufs according to requirements of all pbuf zones. Subsystems that used to have a private limit with old allocator now have private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS, swap, vnode pager. The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9), aio(4). They should have their private limits, but changing that is out of scope of this commit. o Fetch tunable value of kern.nswbuf from init_param2() and while here move NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only this option. Default values aren't touched by this commit, but they probably should be reviewed wrt to modern hardware. This change removes a tight bottleneck from sendfile(2) operation, that uses pbufs in vnode pager. Other pagers also would benefit from faster allocation. Together with: gallatin Tested by: pho	2019-01-15 01:02:16 +00:00

1 2 3 4 5 ...

260 Commits