If a disk's SIM doesn't support polling, then it can't be used to
store crashdumps. Leave d_dump NULL in that case so that dumpon(8)
fails gracefully rather than having dumps fail at crash time.
Reviewed by: scottl, mav, imp
MFC after: 2 weeks
Sponsored by: Chelsio
Differential Revision: https://reviews.freebsd.org/D28454
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
It allows to report GEOM::lunid for nda(4) same as for nvd(4). Since
NVMe now allows multiple LUs (namespaces) with multiple paths unique
LU identification is important. The serial_num field is filled same
as before with the controller serial number, while device_id is based
on namespace GUID and/or EUI64 fields as recommended by "NVM Express:
SCSI Translation Reference" and matching nvd(4) at the end.
MFC after: 1 week
Some automation tries to detect if nvd or nda is in used, and the presence of
both confuses it. Provide a knob to turn off nvd alias creation
(kern.cam.nda.nvd_compat=0) for these situations. The default is the same:
create the nvd compat link.
run it). Make sure that we do. Simplify the flow a bit, and fix a
comment since we do need to do these things.
Noticed by: cperciva (not sure why my invariants kernel didn't trigger)
- maxio should be dp->d_maxsize. This is often MAXPHYS, but not always
(especially if MAXPHYS is > 1MB).
- Unlock the periph before returning. We don't need to relock it to
release the ccb.
- Make sure we release the ccb in error paths.
Reviewed by: cperciva
With these two ioctls implemented in the nda driver, nvmecontrol now
works with nda just like it does with nvd. It eliminates the need to
jump through odd hoops to get this data.
Two arguments were reversed in calls to cam_strvis() in
nvme_da.c. This was found by a Coverity scan of this code within Dell
(Isilon). These are also marked in the FreeBSD Coverity scan as CIDs
1400526 & 1400531.
Submitted by: robert.herndon@dell.com
Reviewed by: vangyzen@, imp@
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D24117
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
BIO_READ and BIO_WRITE, we've handled this expanded syntax poorly in
drivers when the driver doesn't support a particular command. Do a
sweep and fix that.
Reported by: imp
Add two sysctls to control pacing of nvme
trims. kern.cam.nda.X.goal_trim is the number of upper layer
BIO_DEELETE requests to try to collecet before sending TRIM down too
the nvme drive. trim_ticks is the number of ticks, at mosot, to wait
for at least goal_trim BIOS_DELEETE requests to come in.
Trim pacing is useful when a large number off disjoint trims are
comoing in from the upper layers. Since we have no way to chain
toogether trims from the upper layers that are sent down, this acts as
a hueristic to group trims into reasonable sized chunks. What's
reasonable varies from drive to drive.
Sponsored by: Netflix
via 'diskinfo -v'. This avoids the need to track it down via CAM,
and should also work for disks that don't use CAM. And since it's
inherited thru the GEOM hierarchy, in most cases one doesn't need
to walk the GEOM graph either, eg you can use it on a partition
instead of disk itself.
Reviewed by: allanjude, imp
Sponsored by: Klara Inc
Differential Revision: https://reviews.freebsd.org/D22249
Clang trunk recently gained this new warning, and complains about the
sizeof(trim->data) / sizeof(struct nvme_dsm_range) expression, since
the left hand side's element type (char) does not match the right hand
side's type. The byte buffer is unnecessary so we can remove it to clean
up the code and fix the warning at the same time.
No functional change.
Submitted by: James Clarke <jrtc27@jrtc27.com>
Reviewed by: imp
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D21912
XPT_DEV_ADVINFO call should be protected by the lock of the specific
device it is addressed to, not the lock of SES device. In some weird
case, probably with hardware violating standards, it sometimes caused
NULL dereference due to race.
To protect from it further, add lock assertion to *_dev_advinfo().
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differentiate between PCI Express Endpoint devices and Root Complex
Integrated Endpoints in the nda driver. The Link Status and Capability
registers are not valid for Integrated Endpoints and should not be
displayed. The bhyve emulated NVMe device will advertise as being an
Integrated Endpoint.
Reviewed by: imp
Approved byL imp (mentor)
Differential Revision: https://reviews.freebsd.org/D20282
Before this I suppose it was impossible load CAM-based NVMe as module.
Plus this appeared to be needed to build r345815 without NVMe driver.
MFC after: 2 weeks
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.
Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
In the nda(4) driver, only set DISKFLAG_CANDELETE (a.k.a. can support
BIO_DELETE) if the drive supports Dataset Management. There are reports
that without this check, VMWare Workstation does not work reliably.
Fix is to check the ONCS field in the NVMe Controller Data structure for
support. This check previously existed but did not survive the
big-endian changes.
Reported by: yuripv@yuripv.net
Reviewed by: imp, mav, jimharris
Approved by: imp (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18493
Add a counter for the LBAs, Ranges and hardware commands so that we
can provide additional color to the statistics we provide to vendors.
Sponsored by: Netflix, Inc
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).
This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.
This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.
Bump __FreeBSD_version to 1200081 due to API change
Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
The idea was to get the uncontroversial mechanical change out of the way,
then get the meatier functional changes reviewed subsequently. I had not
realized that the immediately adjacent issue was addressed in a different
direction in r334506 (see Warner's guidance in D15592).
Discussion continues, trying to determine if there is a secondary issue
still[1] and how best to fix it. With 12-related activities coming up,
while that is ongoing, just take this back for now.
[1]: Shutdown-time eventhandler events fire normally during panic's reboot
path. Driver callbacks that attempt to issue and wait on interrupt-
completed IO may never complete, hanging the system. This is particularly
obnoxious in the shutdown/panic path, as the debugger cannot be entered
anymore and the hang prevents reboot restoring availability.
(There's nothing CAM-specific about this problem -- any shutdown
event-triggered driver could do something like this during panic. But most
NICs, etc. don't try to send spin-down commands at shutdown. ;-))
Discussed with: imp, markj
No functional change.
Note that this change is careful to set the CCB header xflags after
foo_fill_bar() routines, which generally zero existing flags. An earlier
version of this patch mistakenly set the flag before the fill routines.
Submitted by: Scott Ferris <sferris AT isilon.com>, jhibbits@
Reviewed by: bdrewery@, markj@, and non-committer FreeBSD contributor Anton Rang
Sponsored by: Dell EMC Isilon
It's likely that the header was needed in the past for swi(9).
But now that code does not use swi(9) or any other interfaces defined
in sys/interrupt.h.
MFC after: 1 week
- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
- Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
- Add initial support for namespace change async events. So far only
in CAM mode, but it allows run-time namespace arrival and departure.
- Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.
Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.
Reviewed by: imp
MFC after: 1 month
Sponsored by: iXsystems, Inc.
We're dropping the periph lock then dropping the refcount. However,
that violates the locking protocol and is racy. This seems to be
the cause of weird occasional panics with a bogus assert.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15517
When a disk disappears and the periph is invalidated, any I/Os that
are pending with the controller can cause a crash when they
complete. Move to holding the softc reference count taken in dastart()
until the I/O is complete rather than only until xpt_action()
returns. (This approach was suggested by Ken Merry.) This extends
the method used in da to ada, nda, and mda.
Sponsored by: Netflix
Submitted by: Chuck Silvers
latter casts the LBA to a 32-bit number before assigning it to the 64
bit structure entity. This works fine on the first 2TB of TRIMs, but
terrible beyond that due to trucation.
Also, add an assert to make sure we don't end too many DSM TRIM
entries in one request.
Sponsored by: Netflix
kern.cam.{,a,n}da.X.invalidate=1 forces *daX to detach by calling
cam_periph_invalidate on the underlying periph. This is for testing
purposes only. Include only with options CAM_TEST_FAILURE and rename
the former [AN]DA_TEST_FAILURE, and fix nda to compile with it set.
We're using it at work to harden geom and the buffer cache to be
resilient in the face of drive failure. Today, it far too often
results in a panic. While much work was done on SIM initiated removal
for the USB thumnb drive removal work, little has been done for periph
initiated removal. This simulates what *daerror() does for some errors
nicely: we get the same panics with it that we do with failing drives.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14581
When multiple trims are in the queue, collapse them as much as
possible. At present, this usually results in only a few trims being
collapsed together, but more work on that will make it possible to do
hundreds (up to some configurable max).
Sponsored by: Netflix
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.
NVMe is now working on powerpc64 host.
Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916