freebsd-dev

Author	SHA1	Message	Date
Alexander Motin	4fbbe52365	nvme: Replace potentially long DELAY() with pause(). In some cases like broken hardware nvme(4) may wait minutes for controller response before timeout. Doing so in a tight spin loop made whole system unresponsive. Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29309 Sponsored by: iXsystems, Inc.	2021-03-17 10:35:49 -04:00
Warner Losh	8423f5d4c1	nvme: use config_intrhook_drain to avoid removable card races nvme drives are configured early in boot. However, a number of the configuration steps takes which take a while, so we defer those to a config intrhook that runs before the root filesystem is mounted. At the same time, the PCI hot plug wakes up and tests the status of the card. It may decide that the card has gone away and deletes the child. As part of that process nvme_detach is called. If this call happens after the config_intrhook starts to run, but before it is finished, there's a race where we can tear down the device's soft state while the config_intrhook is still using it. Use the new config_intrhook_drain to disestablish the hook. Either it will be removed w/o running, or the routine will wait for it to finish. This closes the race and allows safe hotplug at any time, even very early in boot. Sponsored by: Netflix, Inc Reviewed by: jhb, mav Differential Revision: https://reviews.freebsd.org/D29006	2021-03-11 09:45:10 -07:00
Warner Losh	dd2516fc07	nvme: Make nvme_ctrlr_hw_reset static nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so make it static. If we need to change this in the future we can.	2021-02-08 13:29:24 -07:00
Warner Losh	9600aa31aa	nvme: use NVME_GONE rather than hard-coded 0xffffffff Make it clearer that the value 0xfffffff is being used to detect the device is gone. We use it other places in the driver for other meanings.	2021-02-08 13:08:48 -07:00
Alexander Motin	1770bae5f8	Remove aligment requirements for passthrough buffer. After r368124 vmapbuf() should happily map misaligned maxphys-sized buffers thanks to extra page added to pbuf_zone.	2020-11-29 00:57:19 +00:00
Alexander Motin	ac90f70d1e	Increase nvme(4) maximum transfer size from 1MB to 2MB. With 4KB page size the 2MB is the maximum we can address with one page PRP. Going further would require chaining, that would add some more complexity. On the other side, to reduce memory consumption, allocate the PRP memory respecting maximum transfer size reported in the controller identify data. Many of NVMe devices support much smaller values, starting from 128KB. To do that we have to change the initialization sequence to pull the data earlier, before setting up the I/O queue pairs. The admin queue pair is still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal, since there is only one such queue with only 16 trackers. Reviewed by: imp MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-11-29 00:20:31 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Alexander Motin	0bed3eabc5	Add PMRCAP printing and fix earlier CAP_HI. MFC after: 3 days	2020-11-14 01:45:34 +00:00
Alexander Motin	46fbd8004f	Fix panic if NVMe is detached before the intrhook call. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-11-12 20:20:43 +00:00
Alexander Motin	c44441f8fd	Print NVMe controller capabilities in verbose dmesg. Those values are not reported in controller identification, while sometimes interesting for development and debugging. MFC after: 1 week	2020-10-28 15:43:29 +00:00
Brooks Davis	44ca4575ea	vmapbuf: don't smuggle address or length in buf Instead, add arguments to vmapbuf. Since this argument is always a pointer use a type of void * and cast to vm_offset_t in vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to take a pointer and check its bounds.) In no other situtation does b_data contain a user pointer and vmapbuf replaces b_data with the actual mapping. Suggested by: jhb Reviewed by: imp, jhb Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26784	2020-10-21 16:00:15 +00:00
Alexander Motin	915f019715	Use RTD3 Entry Latency value as shutdown timeout. This field was not in specs when the driver was written, but now there are SSDs with the reported latency of 10s, where hardcoded value of 5s seems to be not enough sometimes, causing shutdown timeout messages. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-10-14 15:50:28 +00:00
David Bright	e32d47f32d	Add an ioctl to get an NVMe device's maximum transfer size Reviewed by: imp, chuck Obtained from: Dell EMC Isilon MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26390	2020-09-21 15:41:47 +00:00
Mateusz Guzik	d87b31e159	nvme: clean up empty lines in .c and .h files	2020-09-01 22:03:10 +00:00
Warner Losh	881534f09c	Use symbolic names for asych events Rather than \|= 0x300, define and use asyn event names for the name space changes and the firmware activations that we're asking for.	2020-08-31 19:38:03 +00:00
Alexander Motin	701267ad19	Fix few panics on NVMe's timing out initialization requests. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-06-25 20:29:29 +00:00
Alexander Motin	ead7e10308	Make polled request timeout less invasive. Instead of panic after one second of polling, make the normal timeout handler to activate, reset the controller and abort the outstanding requests. If all of it won't happen within 10 seconds then something in the driver is likely stuck bad and panic is the only way out. In particular this fixed device hot unplug during execution of those polled commands, allowing clean device detach instead of panic. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-06-18 19:16:03 +00:00
Alexander Motin	550d5d64fe	Fix admin qpair leak if detached during initial reset. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-06-17 17:51:40 +00:00
Alexander Motin	92390644e3	Fix config_intrhook leak on initial reset failure. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-06-12 14:14:01 +00:00
David Bright	4053f8ac4d	Fix various Coverity-detected errors in nvme driver This fixes several Coverity-detected errors in the nvme driver. CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975, 1403980 Reviewed by: imp@, vangyzen@ MFC after: 5 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D24532	2020-05-02 20:47:58 +00:00
Warner Losh	4e6a434b6b	Make sure that we get the sbuf resources we need. Since we're calling sbuf_new with NOWAIT, make sure it can allocate a buffer to use. Don't print anything if we can't get it. Noticed by: rpokala	2020-04-30 00:43:11 +00:00
Warner Losh	244b805397	Generate a devctl event for interesting events When we reset the controller, and when the controller tells us about a critical warning, send an event.	2020-04-30 00:27:19 +00:00
Alexander Motin	b2cdfb72f4	Fix copy-paste bug in HMB free code. MFC after: 2 weeks X-MFC-with: r356474	2020-01-08 18:26:23 +00:00
Alexander Motin	6de4e458fa	Minor adjustments to r356474 and r356480. Reported by: jkim, imp MFC after: 2 weeks X-MFC-with: r356474	2020-01-07 23:29:54 +00:00
Alexander Motin	1c7dd40e58	Increate HMB limit from 1% to 5%. SSD capacity in laptops is growing faster then RAM size, so my original guess seems too low on second thought. Hopefully nobody will build large array of those crappy SSDs. MFC after: 2 weeks X-MFC-with: 356474	2020-01-07 23:10:38 +00:00
Alexander Motin	67abaee9fc	Add Host Memory Buffer support to nvme(4). This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about 1MB per 1GB on the devices I have) for its metadata cache, significantly improving random I/O performance. Device reports minimal and preferable size of the buffer. The code limits it to 1% of physical RAM by default. If the buffer can not be allocated or below minimal size, the device will just have to work without it. MFC after: 2 weeks Relnotes: yes Sponsored by: iXsystems, Inc.	2020-01-07 21:17:11 +00:00
Warner Losh	7588c6cc36	Move to using bool instead of boolean_t While there are subtle semantic differences between bool and boolean_t, none of them matter in these cases. Prefer true/false when dealing with bool type. Preserve a couple of TRUEs since they are passed into int args into CAM. Preserve a couple of FALSEs when used for status.done, an int. Differential Revision: https://reviews.freebsd.org/D20999	2019-12-13 18:35:48 +00:00
Warner Losh	66e5985084	Move reset to the interrutp processing stage This trims the boot time a bit more for AWS and other platforms that have nvme drives. There's no reason too do this inline. This has been in my tree a while, but IIRC I talked to Jim Harris about this at one of our face to face meetings. MFC After: 2 weeks	2019-12-11 22:51:02 +00:00
Alexander Motin	1eab19cbec	Make nvme(4) driver some more NUMA aware. - For each queue pair precalculate CPU and domain it is bound to. If queue pairs are not per-CPU, then use the domain of the device. - Allocate most of queue pair memory from the domain it is bound to. - Bind callouts to the same CPUs as queue pair to avoid migrations. - Do not assign queue pairs to each SMT thread. It just wasted resources and increased lock congestions. - Remove fixed multiplier of CPUs per queue pair, spread them even. This allows to use more queue pairs in some hardware configurations. - If queue pair serves multiple CPUs, bind different NVMe devices to different CPUs. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-23 17:53:47 +00:00
Warner Losh	f93b7f954e	Support doorbell strides != 0. The NVMe standard (1.4) states >>> 8.6 Doorbell Stride for Software Emulation >>> The doorbell stride,...is useful in software emulation of an NVM >>> Express controller. ... For hardware implementations of the NVM >>> Express interface, the expected doorbell stride value is 0h. However, hardware in the wild exists with a doorbell stride of 1 (meaning 8 byte separation). This change supports that hardware, as well as software emulators as envisioned in Section 8.6. Since this is the fast path, care has been taken to make this computation efficient. The bit of math to compute an offset for each is replaced by a memory load from cache of a pre-computed value. MFC After: 3 days Reviewed by: scottl@ Differential Revision: https://reviews.freebsd.org/D21514	2019-09-04 20:08:36 +00:00
Warner Losh	4d5475613e	Implement nvme suspend / resume for pci attachment When we suspend, we need to properly shutdown the NVME controller. The controller may go into D3 state (or may have the power removed), and to properly flush the metadata to non-volatile RAM, we must complete a normal shutdown. This consists of deleting the I/O queues and setting the shutodown bit. We have to do some extra stuff to make sure we reset the software state of the queues as well. On resume, we have to reset the card twice, for reasons described in the attach funcion. Once we've done that, we can restart the card. If any of this fails, we'll fail the NVMe card, just like we do when a reset fails. Set is_resetting for the duration of the suspend / resume. This keeps the reset taskqueue from running a concurrent reset, and also is needed to prevent any hw completions from queueing more I/O to the card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get that from the global state of the ctrlr. Wait for any pending reset to finish. All queued I/O will get sent to the hardware as part of nvme_ctrlr_start(), though the upper layers shouldn't send any down. Disabling the qpairs is the other failsafe to ensure all I/O is queued. Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid confusion with all the other destroy functions. It just removes the queues in hardware, while the other _destroy_ functions tear down driver data structures. Split parts of the hardware reset function up so that I can do part of the reset in suspsend. Split out the software disabling of the qpairs into nvme_ctrlr_disable_qpairs. Finally, fix a couple of spelling errors in comments related to this. Relnotes: Yes MFC After: 1 week Reviewed by: scottl@ (prior version) Differential Revision: https://reviews.freebsd.org/D21493	2019-09-03 15:26:11 +00:00
Warner Losh	ab0681aac9	In all the places that we use the polled for completion interface, except crash dump support code, move the while loop into an inline function. These aren't done in the fast path, so if the compiler choses to not inline, any performance hit is tiny.	2019-09-02 17:11:27 +00:00
Warner Losh	8e61280bd9	When we have errors resetting the device before we allocate the queues, don't try to tear them down in the ctrlr_destroy path. Otherwise, we dereference queue structures that are NULL and we trap. This fix is incomplete: we leak IRQ and MSI resources when this happens. That's preferable to a crash but still should be fixed.	2019-08-22 21:56:11 +00:00
Warner Losh	f182f928db	Separate the pci attachment from the rest of nvme Nvme drives can be attached in a number of different ways. Separate out the PCI attachment so that we can have other attachment types, like ahci and various types of NVMeoF. Submitted by: cognet@	2019-08-21 22:17:55 +00:00
Alexander Motin	71a2818142	Improve NVMe hot unplug handling. If device is unplugged from the system (CSTS register reads return 0xffffffff), it makes no sense to send any more recovery requests or expect any responses back. If there is a detach call in such state, just stop all activity and free resources. If there is no detach call (hot-plug is not supported), rely on normal timeout handling, but when it trigger controller reset, do not wait for impossible and quickly report failure. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-21 20:17:30 +00:00
Alexander Motin	a6d222eb68	Add more random bits from NVMe 1.4. MFC after: 2 weeks	2019-08-03 02:36:35 +00:00
Alexander Motin	6c99d1325e	Decode few more NVMe log pages. In particular: Changed Namespace List, Commands Supported and Effects, Reservation Notification, Sanitize Status. Add few new arguments to `nvmecontrol log` subcommand. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-02 20:16:21 +00:00
Alexander Motin	a7bf63be69	Add IOCTL to translate nvdX into nvmeY and NSID. While very useful by itself, it also makes `nvmecontrol` not depend on hardcoded device names parsing, that in its turn makes simple to take nvdX (and potentially any other) device names as arguments. Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them interchangeable for management purposes. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-08-01 21:44:07 +00:00
Warner Losh	08a607e0f3	Widen the type for to. The timeout field in the CAPS register is defined to be 8 bits, so its type was uint8_t. We recently started adding 1 to it to cope with rogue devices that listed 0 timeout time (which is impossible). However, in so doing, other devices that list 0xff (for a 2 minute timeout) were broken when adding 1 overflowed. Widen the type to be uint32_t like its source register to avoid the issue. Reported by: bapt@	2019-07-25 20:26:21 +00:00
Warner Losh	62d2cf1847	Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers. These macros make places where we extract these easier to read. The shift and mask stuff is also a bit tedious and error prone. Start with the CAP_LO and CAP_HI registers since their scope is somewhat constrained. This is style chagne only, no functional changes. Reviewed by: chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20979	2019-07-18 15:41:10 +00:00
Warner Losh	dc9df3a59d	Assume that the timeout value from the capacity is 1-based Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1 costs little and copes with those NVMe drives that report '0' in this field cheaply. This is consistent with what the Linux driver does as well.	2019-07-16 22:55:30 +00:00
Warner Losh	9835d216d8	rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairs Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to match init/uninit scenarios. Signed-off-by: John Meneghini <johnm@netapp.com> Submitted by: Michael Hordijk <hordijk@netapp.com> Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D19781	2019-05-08 20:18:11 +00:00
Warner Losh	2ffd6fce5b	Don't print all the I/O we abort on a reset, unless we're out of retries. When resetting the controller, we abort I/O. Prior to this fix, we printed a ton of abort messages for I/O that we're going to retry. This imparts no useful information. Stop printing them unless our retry count is exhausted. Clarify code for when we don't retry, and remove useless arg to a routine that's always called with it as 'true'. All the other debug is still printed (including multiple reset messages if we have multiple timeouts before the taskqueue runs the actual reset) so that we know when we reset. Reviewed by: jimharris@, chuck@ Differential Revision: https://reviews.freebsd.org/D19431	2019-03-09 01:18:16 +00:00
Warner Losh	45d7e233a5	Unconditionally support unmapped BIOs. This was another shim for supporting older kernels. However, all supported versions of FreeBSD have unmapped I/Os (as do several that have gone EOL), remove it. It's unlikely the driver would work on the older kernels anyway at this point.	2019-02-27 22:16:59 +00:00
Gleb Smirnoff	756a541279	Allocate pager bufs from UMA instead of 80-ish mutex protected linked list. o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many pbufs are we going to have set. In various subsystems that are going to utilize pbufs create private zones via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(), and sets a limit on created zone. After startup preallocate pbufs according to requirements of all pbuf zones. Subsystems that used to have a private limit with old allocator now have private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS, swap, vnode pager. The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9), aio(4). They should have their private limits, but changing that is out of scope of this commit. o Fetch tunable value of kern.nswbuf from init_param2() and while here move NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only this option. Default values aren't touched by this commit, but they probably should be reviewed wrt to modern hardware. This change removes a tight bottleneck from sendfile(2) operation, that uses pbufs in vnode pager. Other pagers also would benefit from faster allocation. Together with: gallatin Tested by: pho	2019-01-15 01:02:16 +00:00
Warner Losh	91182bcfb6	Even though they are reserved, cdw2 and cdw3 can be set via nvme-cli (and soon nvmecontrol). Go ahead and copy them into rsvd2 and rsvd3. Sponsored by: Netflix	2018-12-07 21:58:08 +00:00
Chuck Tuffli	9544e6dcf1	Make NVMe compatible with the original API The original NVMe API used bit-fields to represent fields in data structures defined by the specification (e.g. the op-code in the command data structure). The implementation targeted x86_64 processors and defined the bit fields for little endian dwords (i.e. 32 bits). This approach does not work as-is for big endian architectures and was changed to use a combination of bit shifts and masks to support PowerPC. Unfortunately, this changed the NVMe API and forces #ifdef's based on the OS revision level in user space code. This change reverts to something that looks like the original API, but it uses bytes instead of bit-fields inside the packed command structure. As a bonus, this works as-is for both big and little endian CPU architectures. Bump __FreeBSD_version to 1200081 due to API change Reviewed by: imp, kbowling, smh, mav Approved by: imp (mentor) Differential Revision: https://reviews.freebsd.org/D16404	2018-08-22 04:29:24 +00:00
Alexander Motin	f439e3a4ff	Refactor NVMe CAM integration. - Remove layering violation, when NVMe SIM code accessed CAM internal device structures to set pointers on controller and namespace data. Instead make NVMe XPT probe fetch the data directly from hardware. - Cleanup NVMe SIM code, fixing support for multiple namespaces per controller (reporting them as LUNs) and adding controller detach support and run-time namespace change notifications. - Add initial support for namespace change async events. So far only in CAM mode, but it allows run-time namespace arrival and departure. - Add missing nvme_notify_fail_consumers() call on controller detach. Together with previous changes this allows NVMe device detach/unplug. Non-CAM mode still requires a lot of love to stay on par, but at least CAM mode code should not stay in the way so much, becoming much more self-sufficient. Reviewed by: imp MFC after: 1 month Sponsored by: iXsystems, Inc.	2018-05-25 03:34:33 +00:00
Alexander Motin	c252f63740	Fix LOR between controller and queue locks. Admin pass-through requests took controller lock before the queue lock, but in case of request submission to a failed controller controller lock was taken after the queue lock. Fix that by reducing the lock scopes and switching to mtx_pool locks to track pass-through request completion. Sponsored by: iXsystems, Inc.	2018-05-02 20:13:03 +00:00
Alexander Motin	e134ecdcfc	Improve nvme(4) attach/detach sequences. This change allows clean device detach on attach failures and driver unload, while previous code tried to talk to already shut down controller, or even accessed resources failed to allocate. Sponsored by: iXsystems, Inc.	2018-04-30 23:05:57 +00:00

1 2 3

130 Commits