freebsd-nq

Author	SHA1	Message	Date
Gordon Bergling	5f8ccf6515	nvme(4): Correct a typo in a sysctl description - s/printting/printing/ MFC after: 3 days	2021-11-30 10:26:25 +01:00
Warner Losh	2ec165e3f0	nvme: Reduce traffic to the doorbell register Reduce traffic to doorbell register when processing multiple completion events at once. Only write it at the end of the loop after we've processed everything (assuming we found at least one completion, even if that completion wasn't valid). Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D32470	2021-10-14 08:44:37 -06:00
Warner Losh	18dc12bfd2	nvme: Restore hotplug warning Restore hotplug warning in recovery state machine. No functional change other than what message gets printed. Sponsored by: Netflix	2021-10-12 14:26:54 -06:00
Warner Losh	83581511d9	nvme: Use adaptive spinning when polling for completion or state change We only use nvme_completion_poll in the initialization path. The commands they queue and wait for finish quickly as they involve no I/O to the drive's media. These command take about 20-200 microsecnds each. Set the wait time to 1us and then increase it by 1.5 each successive iteration (max 1ms). This reduces initialization time by 80ms in cpervica's tests. Use this same technique waiting for RDY state transitions. This saves another 20ms. In total we're down from ~330ms to ~2ms. Tested by: cperciva Sponsored by: Netflix Reviewed by: mav Differential Review: https://reviews.freebsd.org/D32259	2021-10-01 19:17:55 -06:00
Warner Losh	4b3da659bf	nvme: Only reset once on attach. The FreeBSD nvme driver has reset the nvme controller twice on attach to address a theoretical issue assuring the hardware is in a known state. However, exierence has shown the second reset is unnecessary and increases the time to boot. Eliminate the second reset. Should there be a situation when you need a second reset (for buggy or at least somewhat out of the mainstream hardware), the hardware option NVME_2X_RESET will restore the old behavior. Document this in nvme(4). If there's any trouble at all with this, I'll add a sysctl tunable to control it. Sponsored by: Netflix Reviewed by: cperciva, mav Differential Revision: https://reviews.freebsd.org/D32241	2021-10-01 11:09:34 -06:00
Warner Losh	e5e26e4a24	nvme: Remove pause while resetting After some study of the code and the standard, I think we can just drop the pause(), unconditionally. If we're not initialized, then there's nothing to wait for from a software perspective. If we are initialized, then there might be outstanding I/O. If so, then the qpair 'recovery state' will transition to WAITING in nvme_ctrlr_disable_qpairs, which will ignore any interrupts for items that complete before we complete the reset by setting cc.en=0. If we go on to fail the controller, we'll cancel the outstanding I/O transactions. If we reset the controller, the hardware throws away pending transactions and we retry all the pending I/O transactions. Any transactions that happend to complete before cc.en=0 will have the same effect in the end (doing the same transaction twice is just inefficient, it won't affect the state of the device any differently than having done it once). The standard imposes no wait times here, so it isn't needed from that perspective. Unanswered Question: Do we may need to disable interrupts while we disable in legacy mode since those are level-sensitive. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D32248	2021-10-01 11:09:05 -06:00
Warner Losh	77054a897f	nvme: Explain a workaround a little better The don't touch the mmio of the drive after we do a EN 1->0 transition is only for a tiny number of dirves that have this unforunate issue. Sponsored by: Netflix	2021-10-01 10:56:10 -06:00
Warner Losh	a245627a4e	nvme_ctrlr_enable: Small style nits Rewrite the nested if's using the preferred FreeBSD style for branches of ifs that return. NFC. Minor tweaks to the comments to better fit new code layout. Sponsored by: Netflix Reviewed by: mav, chuck (prior rev, but comments rolled in) Differential Revision: https://reviews.freebsd.org/D32245	2021-10-01 10:56:10 -06:00
Warner Losh	26259f6ab9	nvme: Use MS_2_TICKS rather than rolling our own Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D32246	2021-10-01 10:56:10 -06:00
Warner Losh	d5fca1dc1d	nvme_ctrlr_enable: Remove unnecessary 5ms delays Remove the 5ms delays after writing the administrative queue registers. These delays are from the very earliest days of the driver (they are in the first commit) and were most likely vestiges of the Chatham NVMe prototype card that was used to create this driver. Many of the workarounds necessary for it aren't necessary for standards compliant cards. The original driver had other areas marked for Chatham, but these were not. They are unneeded. There's three lines of supporting evidence. First, the NVMe standards make no mention of a delay time after these registers are written. Second, the Linux driver doesn't have them, even as an option. Third, all my nvme cards work w/o them. To be safe, add a write barrier between setting up the admin queue and enabling the controller. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D32247	2021-10-01 10:56:10 -06:00
Warner Losh	36a87d0c6f	nvme: Sanity check completion id Make sure the completion ID is in the range of [0..num_trackers) since the values past the end of the act_tr array are never going to be valid trackers and will lead to pain and suffering if we try to dereference them to get the tracker or to set the tracker back to NULL as we complete the I/O. Sponsored by: Netflix Reviewed by: mav, chs, chuck Differential Revision: https://reviews.freebsd.org/D32088	2021-09-28 21:21:50 -06:00
Warner Losh	587aa25525	nvme: count number of ignored interrupts Count the number of times we're asked to process completions, but that we ignore because the state of the qpair isn't in RECOVERY_NONE. Sponsored by: Netflix Reviewed by: mav, chuck Differential Revision: https://reviews.freebsd.org/D32212	2021-09-28 21:18:00 -06:00
Warner Losh	7d5eebe0f4	nvme: Add sanity check for phase on startup. The proper phase for the qpiar right after reset in the first interrupt is 1. For it, make sure that we're not still in phase 0. This is an illegal state to be processing interrupts and indicates that we've failed to properly protect against a race between initializing our state and processing interrupts. Modify stat resetting code so it resets the number of interrpts to 1 instead of 0 so we don't trigger a false positive panic. Sponsored by: Netflix Reviewed by: cperciva, mav (prior version) Differential Revision: https://reviews.freebsd.org/D32211	2021-09-28 21:18:00 -06:00
Warner Losh	fa81f3731d	nvme: start qpair in state RECOVERY_WAITING An interrupt happens on the admin queue right away after the reset, so as soon as we enable interrupts, we'll get a call to our interrupt handler. It is safe to ignore this interrupt if we're not yet initialized, or to process it if we are. If we are initialized, we'll see there's no completion records and return. If we're not, we'll process no completion records and return. Either way, nothing is processed and nothing is lost. Until we've completely setup the qpair, we need to avoid processing completion records. Start the qpair in the waiting recovery state so we return immediately when we try to process completions. The code already sets it to 'NONE' when we're initialization is complete. It's safe to defer completion processing here because we don't send any commands before the initialization of the software state of the qpair is complete. And even if we were to somehow send a command prior to that completing, the completion record for that command would be processed when we send commands to the admin qpair after we've setup the software state. There's no good central point to add an assert for this last condition. This fixes an KASSERT "received completion for unknown cmd" panic on boot. Fixes: 502dc84a8b6703e7c0626739179a3cdffdd22d81 Sponsored by: Netflix Reviewed by: mav, cperciva, gallatin Differential Revision: https://reviews.freebsd.org/D32210	2021-09-28 21:16:19 -06:00
Warner Losh	502dc84a8b	nvme: Use shared timeout rather than timeout per transaction Keep track of the approximate time commands are 'due' and the next deadline for a command. twice a second, wake up to see if any commands have entered timeout. If so, quiessce and then enter a recovery mode half the timeout further in the future to allow the ISR to complete. Once we exit recovery mode, we go back to operations as normal. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28583	2021-09-23 16:42:08 -06:00
Warner Losh	4b977e6dda	nvme/nda: Fail all nvme I/Os after controller fails Once the controller has failed, fail all I/O w/o sending it to the device. The reset of the nvme driver won't schedule any I/O to the failed device, and the controller is in an indeterminate state and can't accept I/O. Fail both at the top end of the sim and the bottom end. Don't bother queueing up the I/O for failure in a different task. Reviewed by: chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31341	2021-09-17 16:09:21 -06:00
Colin Percival	bad42df9bf	Add some nvme initialization routines to TSLOG About 335 ms of EC2 instance boot time is being spent here.	2021-09-05 12:48:43 -07:00
Alexander Motin	e3bdf3da76	nvme(4): Add MSI and single MSI-X support. If we can't allocate more MSI-X vectors, accept using single shared. If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but accept single shared. If still no luck, fall back to shared INTx. This provides maximal flexibility in some limited scenarios. For example, vmd(4) does not support INTx and can handle only limited number of MSI/MSI-X vectors without sharing. MFC after: 1 week	2021-08-31 13:45:46 -04:00
Alexander Motin	31111372e6	nvme(4): Do not panic on admin queue construct error. MFC after: 1 week	2021-08-30 20:38:23 -04:00
Alexander Motin	b776de6796	Mark some sysctls as CTLFLAG_MPSAFE. MFC after: 2 weeks	2021-08-10 20:44:27 -04:00
Warner Losh	fc9a084023	nvme: Enable interrupts after qpair fully constructed To guard against the ill effects of a spurious interrupt during construction (or one that was bogusly pending), enable interrupts after the qpair is completely constructed. Otherwise, we can die with null pointer dereferences in nvme_qpair_process_completions. This has been observed in at least one pre-release NVMe drive where the MSIX interrupt fired while the queue was being created, before we'd started the NVMe controller card. The alternative of only turning on the interrupts after the rest was tried, but was insufficient to work around this bug and made the code more complicated w/o benefit. Reviewed by: mav, chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31182	2021-07-15 16:17:23 -06:00
Alexander Motin	e3bcd07d83	nvme(4): Report NPWA before NPWG as stripesize. New Samsung 980 SSDs report Namespace Preferred Write Alignment of 8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB). My quick tests show that 16KB is a minimal sequential write size when the SSD reaches peak IOPS, so writing much less is very slow. But writing slightly less or slightly more does not change much, so it seems not so much a size granularity as minimum I/O size. Thinking about different stripesize consumers: - Partition alignment should be based on NPWA by definition. - ZFS ashift in part of forcing alignment of all I/Os should also be based on NPWA. In part of forcing size granularity, if really needed, it may be set to NPWG, but too big value can make ZFS too space-inefficient, and the 16KB is actually the biggest supported value there now. - ZFS recordsize/volblocksize could potentially be tuned up toward NPWG to work as I/O size granularity, but enabled compression makes it too fuzzy. And those are normally user-configurable things. - ZFS I/O aggregation code could definitely use Optimal Write Size value and may be NPWG, but we don't have fields in GEOM now to report the minimal and optimal I/O sizes, and even maximal is not reported outside GEOM DISK to be used by ZFS. MFC after: 1 week	2021-07-05 23:13:15 -04:00
Warner Losh	aa0ab681ae	nvme: coherently read status of completion records Coherently read the phase bit of the status completion record. We loop over the completion record array, looking for all the transactions in the same phase that have been completed. In doing that, we have to be careful to read the status field first, and if it indicates a complete record, we need to read and process that record. Otherwise, the host might be overtaken by device when reading this completion record, leading to a mistaken belief that the record is in phase. This leads to the code using old values and looking at an already completed entry, which has no current tracker. To work around this problem, we read the status and make sure it is in phase, we then re-read the entire completion record guaranteeing it's complete, valid, and consistent . In addition we resync the dmatag to reflect changes since the prior loop for the bouncing dma case. Reviewed by: jrtc27@, chuck@ Found by: jrtc27 (this fix is based in part on her D30995 fix) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31002	2021-07-02 16:05:19 -06:00
Warner Losh	fea3cf1d6d	nvme: Fix alignment on nvme structures Remove __packed from nvme_command, nvme_completion and nvme_dsm_trim. Add super-alignment to nvme_completion since it's always at least that aligned in hardware (and in our existing uses of it embedded in structures). It generates better code in nvme_qpair_process_completions on riscv64 because otherwise the ABI assumes a 4-byte alignment, and the same on all other platforms. Reviewed by: jrtc27@, mav@, chuck@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31001	2021-07-02 16:05:19 -06:00
Warner Losh	80a75155e1	nvme: style nit Put the { on the same line as the struct nvme_foo when we define these structures. It's FreeBSD standard and these were inconsistent. Sponsored by: Netflix	2021-07-02 16:05:19 -06:00
Warner Losh	f0f4712165	nvme: fix a race between failing the controller and failing requests Part of the nvme recovery process for errors is to reset the card. Sometimes, this results in failing the entire controller. When nda is in use, we free the sim, which will sleep until all the I/O has completed. However, with only one thread, the request fail task never runs once the reset thread sleeps here. Create two threads to allow I/O to fail until it's all processed and the reset task can proceed. This is a temporary kludge until I can work out questions that arose during the review, not least is what was the race that queueing to a failure task solved. The original commit is vague and other error paths in the same context do a direct failure. I'll investigate that more completely before committing changing that to a direct failure. mav@ raised this issue during the review, but didn't otherwise object. Multiple threads, though, solve the problem in the mean time until other such means can be perfected. Reviewed by: jhb@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30366	2021-05-28 23:05:40 -06:00
Dmitry Chagin	a78109d5db	Partially revert r248770. Under geom(4) nvme_ns_bio_process() is on the path where sleep is prohibited as g_io_shedule_down() calls THREAD_NO_SLEEPNG() before geom->start(). Reviewed By: imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29539	2021-04-02 11:43:17 +03:00
Alexander Motin	4fbbe52365	nvme: Replace potentially long DELAY() with pause(). In some cases like broken hardware nvme(4) may wait minutes for controller response before timeout. Doing so in a tight spin loop made whole system unresponsive. Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29309 Sponsored by: iXsystems, Inc.	2021-03-17 10:35:49 -04:00
Warner Losh	8423f5d4c1	nvme: use config_intrhook_drain to avoid removable card races nvme drives are configured early in boot. However, a number of the configuration steps takes which take a while, so we defer those to a config intrhook that runs before the root filesystem is mounted. At the same time, the PCI hot plug wakes up and tests the status of the card. It may decide that the card has gone away and deletes the child. As part of that process nvme_detach is called. If this call happens after the config_intrhook starts to run, but before it is finished, there's a race where we can tear down the device's soft state while the config_intrhook is still using it. Use the new config_intrhook_drain to disestablish the hook. Either it will be removed w/o running, or the routine will wait for it to finish. This closes the race and allows safe hotplug at any time, even very early in boot. Sponsored by: Netflix, Inc Reviewed by: jhb, mav Differential Revision: https://reviews.freebsd.org/D29006	2021-03-11 09:45:10 -07:00
Warner Losh	dd2516fc07	nvme: Make nvme_ctrlr_hw_reset static nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so make it static. If we need to change this in the future we can.	2021-02-08 13:29:24 -07:00
Warner Losh	9600aa31aa	nvme: use NVME_GONE rather than hard-coded 0xffffffff Make it clearer that the value 0xfffffff is being used to detect the device is gone. We use it other places in the driver for other meanings.	2021-02-08 13:08:48 -07:00
Chuck Tuffli	e83fdf8bb3	fix big-endian platforms after 6733401935f8 The NVMe byte-swap routines for big-endian platforms used memcpy() to move the unaligned 64-bit value into a temp register to byte swap it. Instead of introducing a dependency, manually byte-swap the values in place. Point hat: me	2021-01-08 14:41:45 -08:00
Chuck Tuffli	6733401935	nvmecontrol: add device self-test op and log page Add decoding of the Device Self-test log page and the ability to start or abort a test. Reviewed by: imp, mav Tested by: Muhammad Ahmad <muhammad.ahmad@seagate.com> MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D27517	2021-01-08 09:27:56 -08:00
Warner Losh	082905cad1	nvme: Remove a wmb() that's not necessary. bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can be DMA'd immediately after it returns. The details differ, but this mirrors atomic thread release semantics, at least for the buffers synced. For non-x86 platforms, bus_dmamap_sync() has the right syncing and fences. So in the past, wmb() had been omitted for them. For x86 platforms, the memory ordering is already strong enough to ensure DMA to the device sees the current contents. As such, we don't need the wmb() here. It translates to an sfence which is only needed for writes to regions that have the write combining attribute set or when some exotic opcodes are used. The nvme driver does neither of these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we can be assured any optimizer won't reorder the bus_dmamap_sync and the bus_space_write operations. The wmb() was a vestiage of the pre-busdma version initially committed to the tree. Reviewed by: kib@, gallatin@, chuck@, mav@ Differential Revision: https://reviews.freebsd.org/D27448	2020-12-04 21:34:48 +00:00
Michal Meloun	8f9d5a8dbf	NVME: Multiple busdma related fixes. - in nvme_qpair_process_completions() do dma sync before completion buffer is used. - in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure that all CPU stores are visible to external (including DMA) observers. - Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems, buffers continuously owned (and accessed) by DMA must be allocated with this flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems (or coherent buses in mixed systems). MFC after: 4 weeks Reviewed by: mav, imp Differential Revision: https://reviews.freebsd.org/D27446	2020-12-02 16:54:24 +00:00
Chuck Tuffli	8d08cdc721	nvme: Fix typo in definition Change occurrences of "selt test" to "self tests in the NVMe header file. Reviewed by: imp, mav MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27439	2020-12-02 15:59:08 +00:00
Michal Meloun	cf7c062932	Always use the __unused attribute even for potentially unused parameters. Requested by: ian, imp MFC with: r368167	2020-12-01 08:52:13 +00:00
Michal Meloun	b2e9e573a3	Unbreak r368167 in userland. Decorate unused arguments. Reported by: kp, tuexen, jenkins, and many others MFC with: r368167	2020-11-30 14:51:48 +00:00
Michal Meloun	52a832072d	NVME: Don't try to swap data on little endian machines. These swapping functions violate BUSDMA contract - we cannot write to armed (by bus_dmamap_sync(PRE_..)) buffers. Remove them at least from little endian machines until a better solution will be developed. Reviewed by: imp MFC after: 3 weeks	2020-11-30 07:01:12 +00:00
Alexander Motin	1770bae5f8	Remove aligment requirements for passthrough buffer. After r368124 vmapbuf() should happily map misaligned maxphys-sized buffers thanks to extra page added to pbuf_zone.	2020-11-29 00:57:19 +00:00
Alexander Motin	ac90f70d1e	Increase nvme(4) maximum transfer size from 1MB to 2MB. With 4KB page size the 2MB is the maximum we can address with one page PRP. Going further would require chaining, that would add some more complexity. On the other side, to reduce memory consumption, allocate the PRP memory respecting maximum transfer size reported in the controller identify data. Many of NVMe devices support much smaller values, starting from 128KB. To do that we have to change the initialization sequence to pull the data earlier, before setting up the I/O queue pairs. The admin queue pair is still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal, since there is only one such queue with only 16 trackers. Reviewed by: imp MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-11-29 00:20:31 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Michal Meloun	9138770728	Ensure that the buffer is in nvme_single_map() mapped to single segment. Not a functional change. MFC after: 1 week	2020-11-23 14:30:22 +00:00
Alexander Motin	0bed3eabc5	Add PMRCAP printing and fix earlier CAP_HI. MFC after: 3 days	2020-11-14 01:45:34 +00:00
Alexander Motin	46fbd8004f	Fix panic if NVMe is detached before the intrhook call. MFC after: 1 week Sponsored by: iXsystems, Inc.	2020-11-12 20:20:43 +00:00
Mateusz Guzik	71460dfcb2	nvme: change namei_request_zone into a malloc type Both the size (128 bytes) and ephemeral nature of allocations make it a great fit for malloc. A dedicated zone unnecessarily avoids sharing buckets with 128-byte objects. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D27103	2020-11-05 21:44:58 +00:00
Alexander Motin	6dd1985bad	Fix unintentional constant rename in r367109. MFC after: 1 week	2020-10-28 18:22:25 +00:00
Alexander Motin	c44441f8fd	Print NVMe controller capabilities in verbose dmesg. Those values are not reported in controller identification, while sometimes interesting for development and debugging. MFC after: 1 week	2020-10-28 15:43:29 +00:00
Warner Losh	0fc1d20881	nvme: Remove compat code for older kernels Remove code that supported pre-2011 kernels. CTLTYPE_S64 was defined in rev 217616. All supported branches have it, so remove its compat definition as OBE.	2020-10-24 01:59:01 +00:00
Brooks Davis	44ca4575ea	vmapbuf: don't smuggle address or length in buf Instead, add arguments to vmapbuf. Since this argument is always a pointer use a type of void * and cast to vm_offset_t in vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to take a pointer and check its bounds.) In no other situtation does b_data contain a user pointer and vmapbuf replaces b_data with the actual mapping. Suggested by: jhb Reviewed by: imp, jhb Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26784	2020-10-21 16:00:15 +00:00

1 2 3 4 5 ...

330 Commits