run it). Make sure that we do. Simplify the flow a bit, and fix a
comment since we do need to do these things.
Noticed by: cperciva (not sure why my invariants kernel didn't trigger)
- maxio should be dp->d_maxsize. This is often MAXPHYS, but not always
(especially if MAXPHYS is > 1MB).
- Unlock the periph before returning. We don't need to relock it to
release the ccb.
- Make sure we release the ccb in error paths.
Reviewed by: cperciva
With these two ioctls implemented in the nda driver, nvmecontrol now
works with nda just like it does with nvd. It eliminates the need to
jump through odd hoops to get this data.
Two arguments were reversed in calls to cam_strvis() in
nvme_da.c. This was found by a Coverity scan of this code within Dell
(Isilon). These are also marked in the FreeBSD Coverity scan as CIDs
1400526 & 1400531.
Submitted by: robert.herndon@dell.com
Reviewed by: vangyzen@, imp@
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D24117
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
BIO_READ and BIO_WRITE, we've handled this expanded syntax poorly in
drivers when the driver doesn't support a particular command. Do a
sweep and fix that.
Reported by: imp
Add two sysctls to control pacing of nvme
trims. kern.cam.nda.X.goal_trim is the number of upper layer
BIO_DEELETE requests to try to collecet before sending TRIM down too
the nvme drive. trim_ticks is the number of ticks, at mosot, to wait
for at least goal_trim BIOS_DELEETE requests to come in.
Trim pacing is useful when a large number off disjoint trims are
comoing in from the upper layers. Since we have no way to chain
toogether trims from the upper layers that are sent down, this acts as
a hueristic to group trims into reasonable sized chunks. What's
reasonable varies from drive to drive.
Sponsored by: Netflix
via 'diskinfo -v'. This avoids the need to track it down via CAM,
and should also work for disks that don't use CAM. And since it's
inherited thru the GEOM hierarchy, in most cases one doesn't need
to walk the GEOM graph either, eg you can use it on a partition
instead of disk itself.
Reviewed by: allanjude, imp
Sponsored by: Klara Inc
Differential Revision: https://reviews.freebsd.org/D22249
Clang trunk recently gained this new warning, and complains about the
sizeof(trim->data) / sizeof(struct nvme_dsm_range) expression, since
the left hand side's element type (char) does not match the right hand
side's type. The byte buffer is unnecessary so we can remove it to clean
up the code and fix the warning at the same time.
No functional change.
Submitted by: James Clarke <jrtc27@jrtc27.com>
Reviewed by: imp
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D21912
XPT_DEV_ADVINFO call should be protected by the lock of the specific
device it is addressed to, not the lock of SES device. In some weird
case, probably with hardware violating standards, it sometimes caused
NULL dereference due to race.
To protect from it further, add lock assertion to *_dev_advinfo().
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differentiate between PCI Express Endpoint devices and Root Complex
Integrated Endpoints in the nda driver. The Link Status and Capability
registers are not valid for Integrated Endpoints and should not be
displayed. The bhyve emulated NVMe device will advertise as being an
Integrated Endpoint.
Reviewed by: imp
Approved byL imp (mentor)
Differential Revision: https://reviews.freebsd.org/D20282
Before this I suppose it was impossible load CAM-based NVMe as module.
Plus this appeared to be needed to build r345815 without NVMe driver.
MFC after: 2 weeks
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.
Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
In the nda(4) driver, only set DISKFLAG_CANDELETE (a.k.a. can support
BIO_DELETE) if the drive supports Dataset Management. There are reports
that without this check, VMWare Workstation does not work reliably.
Fix is to check the ONCS field in the NVMe Controller Data structure for
support. This check previously existed but did not survive the
big-endian changes.
Reported by: yuripv@yuripv.net
Reviewed by: imp, mav, jimharris
Approved by: imp (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18493
Add a counter for the LBAs, Ranges and hardware commands so that we
can provide additional color to the statistics we provide to vendors.
Sponsored by: Netflix, Inc
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).
This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.
This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.
Bump __FreeBSD_version to 1200081 due to API change
Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
The idea was to get the uncontroversial mechanical change out of the way,
then get the meatier functional changes reviewed subsequently. I had not
realized that the immediately adjacent issue was addressed in a different
direction in r334506 (see Warner's guidance in D15592).
Discussion continues, trying to determine if there is a secondary issue
still[1] and how best to fix it. With 12-related activities coming up,
while that is ongoing, just take this back for now.
[1]: Shutdown-time eventhandler events fire normally during panic's reboot
path. Driver callbacks that attempt to issue and wait on interrupt-
completed IO may never complete, hanging the system. This is particularly
obnoxious in the shutdown/panic path, as the debugger cannot be entered
anymore and the hang prevents reboot restoring availability.
(There's nothing CAM-specific about this problem -- any shutdown
event-triggered driver could do something like this during panic. But most
NICs, etc. don't try to send spin-down commands at shutdown. ;-))
Discussed with: imp, markj
No functional change.
Note that this change is careful to set the CCB header xflags after
foo_fill_bar() routines, which generally zero existing flags. An earlier
version of this patch mistakenly set the flag before the fill routines.
Submitted by: Scott Ferris <sferris AT isilon.com>, jhibbits@
Reviewed by: bdrewery@, markj@, and non-committer FreeBSD contributor Anton Rang
Sponsored by: Dell EMC Isilon
It's likely that the header was needed in the past for swi(9).
But now that code does not use swi(9) or any other interfaces defined
in sys/interrupt.h.
MFC after: 1 week
- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
- Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
- Add initial support for namespace change async events. So far only
in CAM mode, but it allows run-time namespace arrival and departure.
- Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.
Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.
Reviewed by: imp
MFC after: 1 month
Sponsored by: iXsystems, Inc.
We're dropping the periph lock then dropping the refcount. However,
that violates the locking protocol and is racy. This seems to be
the cause of weird occasional panics with a bogus assert.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15517
When a disk disappears and the periph is invalidated, any I/Os that
are pending with the controller can cause a crash when they
complete. Move to holding the softc reference count taken in dastart()
until the I/O is complete rather than only until xpt_action()
returns. (This approach was suggested by Ken Merry.) This extends
the method used in da to ada, nda, and mda.
Sponsored by: Netflix
Submitted by: Chuck Silvers
latter casts the LBA to a 32-bit number before assigning it to the 64
bit structure entity. This works fine on the first 2TB of TRIMs, but
terrible beyond that due to trucation.
Also, add an assert to make sure we don't end too many DSM TRIM
entries in one request.
Sponsored by: Netflix
kern.cam.{,a,n}da.X.invalidate=1 forces *daX to detach by calling
cam_periph_invalidate on the underlying periph. This is for testing
purposes only. Include only with options CAM_TEST_FAILURE and rename
the former [AN]DA_TEST_FAILURE, and fix nda to compile with it set.
We're using it at work to harden geom and the buffer cache to be
resilient in the face of drive failure. Today, it far too often
results in a panic. While much work was done on SIM initiated removal
for the USB thumnb drive removal work, little has been done for periph
initiated removal. This simulates what *daerror() does for some errors
nicely: we get the same panics with it that we do with failing drives.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14581
When multiple trims are in the queue, collapse them as much as
possible. At present, this usually results in only a few trims being
collapsed together, but more work on that will make it possible to do
hundreds (up to some configurable max).
Sponsored by: Netflix
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.
NVMe is now working on powerpc64 host.
Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
Now that we're queueing BIO_DELETE requests in the CAM I/O scheduler,
it make sense to try to combine as many as possible into a single
request to send down to hardware. Hopefully, lots of larger requests
like this are better than lots of individual transactions.
Note for future: need to limit based on total size of the trim
request. Should also collapse adjacent ranges where possible to
increase the size of the max payload.
Sponsored by: Netflix
Introduce flags word to describe the capacities of the peripheral.
First bit will describe if the periph driver allows multiple
outstanding TRIMS to be active in a device.
Modify the I/O scheduler so that the nda driver can queue trims
for a while after the first one arrives. We'll queue until we see
a I/O scheduler tick, then we'll schedule as many TRIMs as allowed
by other factors (currently this is slocts in the NVMe controller).
This mariginally helps the read latency issues we see with reads,
but sets the stage for the nda driver to do TRIM collapsing like the
da and ada drivers do today.
Sponsored by: Netflix
There's no compelling reason to return a cam_status type for this
function and doing so only creates confusion with normal C
coding practices. It's technically an API change, but the periph API
isn't widely used. No efffective change to operation.
Reviewed by: imp, mav, ken
Sponsored by: Netflix
Differential Revision: D14063
here. Return ENOMEM when we can't malloc a buffer for the DSM
TRIM. This should fix the WITNESS warnings similar to the following:
uma_zalloc_arg: zone "16" with the following non-sleepable locks held:
exclusive sleep mutex CAM device lock (CAM device lock) r = 0 (0xfffff800080c34d0) locked @ /usr/src/sys/cam/nvme/nvme_da.c:351
Reviewed by: scottl@
Sponsored by: Netflix