freebsd-nq

Author	SHA1	Message	Date
Colin Percival	bad42df9bf	Add some nvme initialization routines to TSLOG About 335 ms of EC2 instance boot time is being spent here.	2021-09-05 12:48:43 -07:00
Alexander Motin	e3bdf3da76	nvme(4): Add MSI and single MSI-X support. If we can't allocate more MSI-X vectors, accept using single shared. If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but accept single shared. If still no luck, fall back to shared INTx. This provides maximal flexibility in some limited scenarios. For example, vmd(4) does not support INTx and can handle only limited number of MSI/MSI-X vectors without sharing. MFC after: 1 week	2021-08-31 13:45:46 -04:00
Alexander Motin	31111372e6	nvme(4): Do not panic on admin queue construct error. MFC after: 1 week	2021-08-30 20:38:23 -04:00
Alexander Motin	b776de6796	Mark some sysctls as CTLFLAG_MPSAFE. MFC after: 2 weeks	2021-08-10 20:44:27 -04:00
Warner Losh	fc9a084023	nvme: Enable interrupts after qpair fully constructed To guard against the ill effects of a spurious interrupt during construction (or one that was bogusly pending), enable interrupts after the qpair is completely constructed. Otherwise, we can die with null pointer dereferences in nvme_qpair_process_completions. This has been observed in at least one pre-release NVMe drive where the MSIX interrupt fired while the queue was being created, before we'd started the NVMe controller card. The alternative of only turning on the interrupts after the rest was tried, but was insufficient to work around this bug and made the code more complicated w/o benefit. Reviewed by: mav, chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31182	2021-07-15 16:17:23 -06:00
Alexander Motin	e3bcd07d83	nvme(4): Report NPWA before NPWG as stripesize. New Samsung 980 SSDs report Namespace Preferred Write Alignment of 8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB). My quick tests show that 16KB is a minimal sequential write size when the SSD reaches peak IOPS, so writing much less is very slow. But writing slightly less or slightly more does not change much, so it seems not so much a size granularity as minimum I/O size. Thinking about different stripesize consumers: - Partition alignment should be based on NPWA by definition. - ZFS ashift in part of forcing alignment of all I/Os should also be based on NPWA. In part of forcing size granularity, if really needed, it may be set to NPWG, but too big value can make ZFS too space-inefficient, and the 16KB is actually the biggest supported value there now. - ZFS recordsize/volblocksize could potentially be tuned up toward NPWG to work as I/O size granularity, but enabled compression makes it too fuzzy. And those are normally user-configurable things. - ZFS I/O aggregation code could definitely use Optimal Write Size value and may be NPWG, but we don't have fields in GEOM now to report the minimal and optimal I/O sizes, and even maximal is not reported outside GEOM DISK to be used by ZFS. MFC after: 1 week	2021-07-05 23:13:15 -04:00
Warner Losh	aa0ab681ae	nvme: coherently read status of completion records Coherently read the phase bit of the status completion record. We loop over the completion record array, looking for all the transactions in the same phase that have been completed. In doing that, we have to be careful to read the status field first, and if it indicates a complete record, we need to read and process that record. Otherwise, the host might be overtaken by device when reading this completion record, leading to a mistaken belief that the record is in phase. This leads to the code using old values and looking at an already completed entry, which has no current tracker. To work around this problem, we read the status and make sure it is in phase, we then re-read the entire completion record guaranteeing it's complete, valid, and consistent . In addition we resync the dmatag to reflect changes since the prior loop for the bouncing dma case. Reviewed by: jrtc27@, chuck@ Found by: jrtc27 (this fix is based in part on her D30995 fix) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31002	2021-07-02 16:05:19 -06:00
Warner Losh	fea3cf1d6d	nvme: Fix alignment on nvme structures Remove __packed from nvme_command, nvme_completion and nvme_dsm_trim. Add super-alignment to nvme_completion since it's always at least that aligned in hardware (and in our existing uses of it embedded in structures). It generates better code in nvme_qpair_process_completions on riscv64 because otherwise the ABI assumes a 4-byte alignment, and the same on all other platforms. Reviewed by: jrtc27@, mav@, chuck@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31001	2021-07-02 16:05:19 -06:00
Warner Losh	80a75155e1	nvme: style nit Put the { on the same line as the struct nvme_foo when we define these structures. It's FreeBSD standard and these were inconsistent. Sponsored by: Netflix	2021-07-02 16:05:19 -06:00
Warner Losh	f0f4712165	nvme: fix a race between failing the controller and failing requests Part of the nvme recovery process for errors is to reset the card. Sometimes, this results in failing the entire controller. When nda is in use, we free the sim, which will sleep until all the I/O has completed. However, with only one thread, the request fail task never runs once the reset thread sleeps here. Create two threads to allow I/O to fail until it's all processed and the reset task can proceed. This is a temporary kludge until I can work out questions that arose during the review, not least is what was the race that queueing to a failure task solved. The original commit is vague and other error paths in the same context do a direct failure. I'll investigate that more completely before committing changing that to a direct failure. mav@ raised this issue during the review, but didn't otherwise object. Multiple threads, though, solve the problem in the mean time until other such means can be perfected. Reviewed by: jhb@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30366	2021-05-28 23:05:40 -06:00
Dmitry Chagin	a78109d5db	Partially revert r248770. Under geom(4) nvme_ns_bio_process() is on the path where sleep is prohibited as g_io_shedule_down() calls THREAD_NO_SLEEPNG() before geom->start(). Reviewed By: imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29539	2021-04-02 11:43:17 +03:00
Alexander Motin	4fbbe52365	nvme: Replace potentially long DELAY() with pause(). In some cases like broken hardware nvme(4) may wait minutes for controller response before timeout. Doing so in a tight spin loop made whole system unresponsive. Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29309 Sponsored by: iXsystems, Inc.	2021-03-17 10:35:49 -04:00
Warner Losh	8423f5d4c1	nvme: use config_intrhook_drain to avoid removable card races nvme drives are configured early in boot. However, a number of the configuration steps takes which take a while, so we defer those to a config intrhook that runs before the root filesystem is mounted. At the same time, the PCI hot plug wakes up and tests the status of the card. It may decide that the card has gone away and deletes the child. As part of that process nvme_detach is called. If this call happens after the config_intrhook starts to run, but before it is finished, there's a race where we can tear down the device's soft state while the config_intrhook is still using it. Use the new config_intrhook_drain to disestablish the hook. Either it will be removed w/o running, or the routine will wait for it to finish. This closes the race and allows safe hotplug at any time, even very early in boot. Sponsored by: Netflix, Inc Reviewed by: jhb, mav Differential Revision: https://reviews.freebsd.org/D29006	2021-03-11 09:45:10 -07:00
Warner Losh	dd2516fc07	nvme: Make nvme_ctrlr_hw_reset static nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so make it static. If we need to change this in the future we can.	2021-02-08 13:29:24 -07:00
Warner Losh	9600aa31aa	nvme: use NVME_GONE rather than hard-coded 0xffffffff Make it clearer that the value 0xfffffff is being used to detect the device is gone. We use it other places in the driver for other meanings.	2021-02-08 13:08:48 -07:00
Chuck Tuffli	e83fdf8bb3	fix big-endian platforms after 6733401935f8 The NVMe byte-swap routines for big-endian platforms used memcpy() to move the unaligned 64-bit value into a temp register to byte swap it. Instead of introducing a dependency, manually byte-swap the values in place. Point hat: me	2021-01-08 14:41:45 -08:00
Chuck Tuffli	6733401935	nvmecontrol: add device self-test op and log page Add decoding of the Device Self-test log page and the ability to start or abort a test. Reviewed by: imp, mav Tested by: Muhammad Ahmad <muhammad.ahmad@seagate.com> MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D27517	2021-01-08 09:27:56 -08:00
Warner Losh	082905cad1	nvme: Remove a wmb() that's not necessary. bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can be DMA'd immediately after it returns. The details differ, but this mirrors atomic thread release semantics, at least for the buffers synced. For non-x86 platforms, bus_dmamap_sync() has the right syncing and fences. So in the past, wmb() had been omitted for them. For x86 platforms, the memory ordering is already strong enough to ensure DMA to the device sees the current contents. As such, we don't need the wmb() here. It translates to an sfence which is only needed for writes to regions that have the write combining attribute set or when some exotic opcodes are used. The nvme driver does neither of these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we can be assured any optimizer won't reorder the bus_dmamap_sync and the bus_space_write operations. The wmb() was a vestiage of the pre-busdma version initially committed to the tree. Reviewed by: kib@, gallatin@, chuck@, mav@ Differential Revision: https://reviews.freebsd.org/D27448	2020-12-04 21:34:48 +00:00
Michal Meloun	8f9d5a8dbf	NVME: Multiple busdma related fixes. - in nvme_qpair_process_completions() do dma sync before completion buffer is used. - in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure that all CPU stores are visible to external (including DMA) observers. - Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems, buffers continuously owned (and accessed) by DMA must be allocated with this flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems (or coherent buses in mixed systems). MFC after: 4 weeks Reviewed by: mav, imp Differential Revision: https://reviews.freebsd.org/D27446	2020-12-02 16:54:24 +00:00
Chuck Tuffli	8d08cdc721	nvme: Fix typo in definition Change occurrences of "selt test" to "self tests in the NVMe header file. Reviewed by: imp, mav MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27439	2020-12-02 15:59:08 +00:00
Michal Meloun	cf7c062932	Always use the __unused attribute even for potentially unused parameters. Requested by: ian, imp MFC with: r368167	2020-12-01 08:52:13 +00:00
Michal Meloun	b2e9e573a3	Unbreak r368167 in userland. Decorate unused arguments. Reported by: kp, tuexen, jenkins, and many others MFC with: r368167	2020-11-30 14:51:48 +00:00
Michal Meloun	52a832072d	NVME: Don't try to swap data on little endian machines. These swapping functions violate BUSDMA contract - we cannot write to armed (by bus_dmamap_sync(PRE_..)) buffers. Remove them at least from little endian machines until a better solution will be developed. Reviewed by: imp MFC after: 3 weeks	2020-11-30 07:01:12 +00:00
Alexander Motin	1770bae5f8	Remove aligment requirements for passthrough buffer. After r368124 vmapbuf() should happily map misaligned maxphys-sized buffers thanks to extra page added to pbuf_zone.	2020-11-29 00:57:19 +00:00
Alexander Motin	ac90f70d1e	Increase nvme(4) maximum transfer size from 1MB to 2MB. With 4KB page size the 2MB is the maximum we can address with one page PRP. Going further would require chaining, that would add some more complexity. On the other side, to reduce memory consumption, allocate the PRP memory respecting maximum transfer size reported in the controller identify data. Many of NVMe devices support much smaller values, starting from 128KB. To do that we have to change the initialization sequence to pull the data earlier, before setting up the I/O queue pairs. The admin queue pair is still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal, since there is only one such queue with only 16 trackers. Reviewed by: imp MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-11-29 00:20:31 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Michal Meloun	9138770728	Ensure that the buffer is in nvme_single_map() mapped to single segment. Not a functional change. MFC after: 1 week	2020-11-23 14:30:22 +00:00
Alexander Motin	0bed3eabc5	Add PMRCAP printing and fix earlier CAP_HI. MFC after: 3 days	2020-11-14 01:45:34 +00:00
Alexander Motin	46fbd8004f	Fix panic if NVMe is detached before the intrhook call. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-11-12 20:20:43 +00:00
Mateusz Guzik	71460dfcb2	nvme: change namei_request_zone into a malloc type Both the size (128 bytes) and ephemeral nature of allocations make it a great fit for malloc. A dedicated zone unnecessarily avoids sharing buckets with 128-byte objects. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D27103	2020-11-05 21:44:58 +00:00
Alexander Motin	6dd1985bad	Fix unintentional constant rename in r367109. MFC after: 1 week	2020-10-28 18:22:25 +00:00
Alexander Motin	c44441f8fd	Print NVMe controller capabilities in verbose dmesg. Those values are not reported in controller identification, while sometimes interesting for development and debugging. MFC after: 1 week	2020-10-28 15:43:29 +00:00
Warner Losh	0fc1d20881	nvme: Remove compat code for older kernels Remove code that supported pre-2011 kernels. CTLTYPE_S64 was defined in rev 217616. All supported branches have it, so remove its compat definition as OBE.	2020-10-24 01:59:01 +00:00
Brooks Davis	44ca4575ea	vmapbuf: don't smuggle address or length in buf Instead, add arguments to vmapbuf. Since this argument is always a pointer use a type of void * and cast to vm_offset_t in vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to take a pointer and check its bounds.) In no other situtation does b_data contain a user pointer and vmapbuf replaces b_data with the actual mapping. Suggested by: jhb Reviewed by: imp, jhb Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26784	2020-10-21 16:00:15 +00:00
Alexander Motin	915f019715	Use RTD3 Entry Latency value as shutdown timeout. This field was not in specs when the driver was written, but now there are SSDs with the reported latency of 10s, where hardcoded value of 5s seems to be not enough sometimes, causing shutdown timeout messages. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-10-14 15:50:28 +00:00
David Bright	e32d47f32d	Add an ioctl to get an NVMe device's maximum transfer size Reviewed by: imp, chuck Obtained from: Dell EMC Isilon MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26390	2020-09-21 15:41:47 +00:00
Mateusz Guzik	d87b31e159	nvme: clean up empty lines in .c and .h files	2020-09-01 22:03:10 +00:00
Warner Losh	881534f09c	Use symbolic names for asych events Rather than \|= 0x300, define and use asyn event names for the name space changes and the firmware activations that we're asking for.	2020-08-31 19:38:03 +00:00
Alexander Motin	97dc595da2	Report cpi->hba_* for nda(4) because why not. MFC after: 1 week	2020-08-12 20:05:43 +00:00
Mark Johnston	96ad26eefb	Remove free_domain() and uma_zfree_domain(). These functions were introduced before UMA started ensuring that freed memory gets placed in domain-local caches. They no longer serve any purpose since UMA now provides their functionality by default. Remove them to simplyify the kernel memory allocator interfaces a bit. Reviewed by: cem, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25937	2020-08-04 13:58:36 +00:00
Alexander Motin	701267ad19	Fix few panics on NVMe's timing out initialization requests. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-06-25 20:29:29 +00:00
Alexander Motin	ead7e10308	Make polled request timeout less invasive. Instead of panic after one second of polling, make the normal timeout handler to activate, reset the controller and abort the outstanding requests. If all of it won't happen within 10 seconds then something in the driver is likely stuck bad and panic is the only way out. In particular this fixed device hot unplug during execution of those polled commands, allowing clean device detach instead of panic. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-06-18 19:16:03 +00:00
Alexander Motin	550d5d64fe	Fix admin qpair leak if detached during initial reset. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-06-17 17:51:40 +00:00
Alexander Motin	92390644e3	Fix config_intrhook leak on initial reset failure. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-06-12 14:14:01 +00:00
David Bright	4053f8ac4d	Fix various Coverity-detected errors in nvme driver This fixes several Coverity-detected errors in the nvme driver. CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975, 1403980 Reviewed by: imp@, vangyzen@ MFC after: 5 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D24532	2020-05-02 20:47:58 +00:00
Warner Losh	d5cc572ce6	Add KASSERT to ensure sane nsid. All callers are currently filtering bad nsid to this function, however, we'll have undefined behavior if that's not true. Add the KASSERT to prevent that.	2020-05-01 21:24:19 +00:00
Warner Losh	950475ca20	Rename ns notification function... This function is called whenever the namespace is added, deleted or changes. Update the name to reflect that. No functional change.	2020-05-01 21:24:15 +00:00
Warner Losh	7bc979480e	Style(9) nit: put function name at start of line.	2020-04-30 20:58:38 +00:00
Warner Losh	9cde78942f	Move / reword a comment. Explain what we're doing with mapping CAM's notion of a LUN to NVMe's notion of a namespace.	2020-04-30 20:58:33 +00:00
Warner Losh	4e6a434b6b	Make sure that we get the sbuf resources we need. Since we're calling sbuf_new with NOWAIT, make sure it can allocate a buffer to use. Don't print anything if we can't get it. Noticed by: rpokala	2020-04-30 00:43:11 +00:00

1 2 3 4 5 ...

314 Commits