Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.
Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
r212160 tightened this from always using MSG_SIMPLE_Q_TAG to always
MSG_ORDERED_Q_TAG. Since it also marked all BIO_FLUSH requests with
BIO_ORDERED, this commit changes nothing immediately, but it returns
BIO_FLUSH callers ability to actually specify ordering they really
need, alike to other request types.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
- Add ADA_Q_NO_TRIM quirk to be used with the device that falsely advertise TRIM support
- Add ADA_Q_NO_TRIM entry for KingDian S200 SSD
PR: 222802
Submitted by: Bertrand Petit <bsdpr@phoe.frmug.org>
MFC after: 1 week
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
It has been reported that on some systems (with real hardware passed
through to a virtual machine) the WP detection causes USB disk probing
failures.
While here, also fix the selection of the next state in the case
of malloc failure in DA_STATE_PROBE_WP. It was DA_STATE_PROBE_RC
unconditionally even when it should have been DA_STATE_PROBE_RC16.
PR: 225794
Reported by: David Boyd <David.Boyd49@twc.com>
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D18496
In the nda(4) driver, only set DISKFLAG_CANDELETE (a.k.a. can support
BIO_DELETE) if the drive supports Dataset Management. There are reports
that without this check, VMWare Workstation does not work reliably.
Fix is to check the ONCS field in the NVMe Controller Data structure for
support. This check previously existed but did not survive the
big-endian changes.
Reported by: yuripv@yuripv.net
Reviewed by: imp, mav, jimharris
Approved by: imp (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18493
Add the ability to set two goals for trims in the I/O scheduler. The
first goal is the number of BIO_DELETEs to accumulate
(kern.cam.XX.U.trim_goal). When non-zero, this many trims will be
accumulated before we start to transfer them to lower layers. This is
useful for devices that like to get lots of trims all at once in one
transaction (not all devices are like this, and some vary by workload).
The second is a number of ticks to defer trims. If you've set a trim
goal, then kern.cam.XX.U.trim_ticks controls how long the system will
defer those trims before timing out and sending them anyway. It has no
effect when trim_goal is 0.
In any event, a BIO_FLUSH will cause all the TRIMs to be released to
the periph drivers. This may be a minor overloading of what BIO_FLUSH
is supposed to mean, but it's useful to preserve other ordering
semantics that users of BIO_FLUSH reply on.
Sponsored by: Netflix, Inc
It's often useful to have a callback when an I/O takes more than a
threshold amount of time. This adds the infrastructure for periph
devices to register one.
One use-case is as a debugging aide when you need a semi-realtime
indication of an I/O outlier so you can trigger bus capture gear for
vendor analysis.
Sponsored by: Netflix, Inc
Many async events that we see are called for this specific path. When
calling an async callback for a targetted device, XTP will lock that
specific device's path lock (same as what cam_periph_lock does). For
those AC_ events, assert we have the lock rather than trying to
recusrively take it (which causes panics since it's not recursive).
Add annotations about this and about the fact that AC_SCSI_AEN events
are generated now only in the ata stack (which cannot have a scsi_da
attachment). Leave it in place in case I've overlooked something as
the code is harmless.
This is fallout from my attempts to "fix" locking for softc->flags in
r330796 that's not been triggered often enough to get my attention
until now.
Sponsored by: Netflix
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D17837
Add a counter for the LBAs, Ranges and hardware commands so that we
can provide additional color to the statistics we provide to vendors.
Sponsored by: Netflix, Inc
scsi_low was a common set of routines to do the SCSI bus sequencing
for the ncv, nsp and stg drivers. Those have been removed, so it's no
longer needed since nothing else in the tree uses it and nothing
likely ever will (it's for super-low-end 8-bit parallel SCSI cards).
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other members which require no translation.
Reviewed by: kib (prior version), jhb
Approved by: re (rgrimes)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Review: https://reviews.freebsd.org/D17378
In the probe case for SCSI SMR Host Aware or Most Managed drives, be sure
to free allocated memory.
sys/cam/scsi/scsi_da.c:
In dadone_probezone(), free the data pointer before returning.
MFC after: 3 days
Sponsored by: Spectra Logic
Approved by: re (kib)
of the fix/workaround for the "ctld hanging on reload" problem.
PR: 220175
Reported by: Eugene M. Zheganin <emz at norma.perm.ru>
Tested by: Eugene M. Zheganin <emz at norma.perm.ru>
Approved by: re (kib)
MFC after: 2 weeks
Sponsored by: playkey.net
Somehow this was working even after PTI in, at least on amd64, and got
broken by something only very recently.
Reviewed by: araujo
Approved by: re (gjb)
for the "ctld hanging on reload" problem observed in same cases under
high load. I'm not 100% sure it's _the_ fix, as the issue is rather hard
to reproduce, but it was tested as part of a larger path and the problem
disappeared. It certainly shouldn't break anything.
Now, technically, it shouldn't be needed. Quoting mav@, "After
ct->ct_online == 0 there should be no new sessions attached to the target.
And if you see some problems abbout it, it may either mean that there are
some races where single cfiscsi_session_terminate(cs) call may be lost,
or as a guess while this thread was sleeping target was reenabbled and
redisabled again". Should such race be discovered and properly fixed
in the future, than this and the followup two commits can be backed out.
PR: 220175
Reported by: Eugene M. Zheganin <emz at norma.perm.ru>
Tested by: Eugene M. Zheganin <emz at norma.perm.ru>
Discussed with: mav
Approved by: re (gjb)
MFC after: 2 weeks
Sponsored by: playkey.net
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).
This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.
This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.
Bump __FreeBSD_version to 1200081 due to API change
Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
race condition, due to a missing call to cfiscsi_target_release().
Discussed with: mav@
Tested by: Eugene M. Zheganin <emz at norma.perm.ru> (earlier version)
MFC after: 2 weeks
Sponsored by: playkey.net
xpt_sim_poll takes the sim to poll as an argument. It will do the
proper locking protocol, call the SIM polling routine, and then call
camisr_runqueue to process completions on any CCBs the SIM's poll
routine completed. It will be used during late shutdown when a SIM is
waiting for CCBs it sent during shutdown to finish and the scheduler
isn't running because we've panic'd.
This sequence was used twice in cam_xpt, so refactor those to use this
new function.
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16663
The idea was to get the uncontroversial mechanical change out of the way,
then get the meatier functional changes reviewed subsequently. I had not
realized that the immediately adjacent issue was addressed in a different
direction in r334506 (see Warner's guidance in D15592).
Discussion continues, trying to determine if there is a secondary issue
still[1] and how best to fix it. With 12-related activities coming up,
while that is ongoing, just take this back for now.
[1]: Shutdown-time eventhandler events fire normally during panic's reboot
path. Driver callbacks that attempt to issue and wait on interrupt-
completed IO may never complete, hanging the system. This is particularly
obnoxious in the shutdown/panic path, as the debugger cannot be entered
anymore and the hang prevents reboot restoring availability.
(There's nothing CAM-specific about this problem -- any shutdown
event-triggered driver could do something like this during panic. But most
NICs, etc. don't try to send spin-down commands at shutdown. ;-))
Discussed with: imp, markj
No functional change.
Note that this change is careful to set the CCB header xflags after
foo_fill_bar() routines, which generally zero existing flags. An earlier
version of this patch mistakenly set the flag before the fill routines.
Submitted by: Scott Ferris <sferris AT isilon.com>, jhibbits@
Reviewed by: bdrewery@, markj@, and non-committer FreeBSD contributor Anton Rang
Sponsored by: Dell EMC Isilon
read bias so we do reads in preference to TRIMs. This helps a lot when
many trims are delivered at once from the upper layers as they tend to
delay READs due to priority inversion in the code today.
The non iosched case will be fixed when the trim comibing changes
needed for nvme come in later this year.
Sponsored by: Netflix
We've got a set of probably damaged hard disks, reporting 0x04,0x02
("Logical unit not ready, initializing command required") in response
to READ CAPACITY(16), where attempts to use START STOP UNIT for recovery
results in 0x44,0x00 ("Internal target failure") after ~1 second delay.
As result of all recovery retries, device open attempt took ~3 seconds
before finally reporting to GEOM that device is opened, but has no media.
If the open was for writing and since it hasn't formally failed, following
close triggered GEOM retaste, opening device few more times with respective
delays.
This change reduces whole time of this cycle from ~12 seconds to ~3 by
giving up on recovery after the first failure.
Reviewed by: ken
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
It's likely that the header was needed in the past for swi(9).
But now that code does not use swi(9) or any other interfaces defined
in sys/interrupt.h.
MFC after: 1 week
Unlike SD cards, that publish RCA in response to CMD3,
MMC cards expect the host to set RCA itself.
Since we don't support multiple MMC cards on the bus,
just assign a static RCA of 2 to the attached MMC card.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D13063
Regulator framework doens't like turning off already turned off
regulators, so we get panic on AllWinner boards.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15890
Lower layers (MMC / SDHCI controller drivers) may make certain decisions
based on the presence of this flag. The fact that sdhci.c doesn't
look at this flag is another problem that should be fixed separately.
Found when adding MMCCAM support to AllWinner MMC controller driver
where the presence of this flag actually matters.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15888
Probing host aware and host managed SMR drives got broken in revision
330796.
The added cam_periph_lock() calls were in areas in dadone() where
the peripheral lock was already held.
Since then, dadone() has been split into separate functions that are
dedicated to each probe state.
The result is that when probing a host aware drive, I ran into a recursive
lock acquisition in dadone_probeatalogdir(). I would have run into the
same problem in dadone_probeataiddir(), and in dadone_probeatasup() and
dadone_probeatazone() in the error paths had the probe continued.
The solution is to take out all of the extra cam_periph_lock() calls. I
also added cam_periph_assert(periph, MA_OWNED) near the top of each of
the dadone_* calls. These make it clear to anyone coming along in the
the future that the lock is held in the probe done functions.
Also add a locking assert in daprobedone(), to make it clear that it must
be called with the periph lock held.
Sponsored by: Spectra Logic
Differential Revision: https://reviews.freebsd.org/D15764
Increasing operating frequency without telling card to switch
to high-speed mode first upsets some cards and generates CRC errors.
While here, deselect / reselect cards after CMD6 and SCR fetch, as in original code.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15568
- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
- Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
- Add initial support for namespace change async events. So far only
in CAM mode, but it allows run-time namespace arrival and departure.
- Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.
Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.
Reviewed by: imp
MFC after: 1 month
Sponsored by: iXsystems, Inc.
We're dropping the periph lock then dropping the refcount. However,
that violates the locking protocol and is racy. This seems to be
the cause of weird occasional panics with a bogus assert.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15517
For MMC cards, add partitions found on the card as separate disk(9) devices.
Don't do anything with RPMB partition for now.
Lots of code is copied almost 1:1 from the mmcsd.c in the old stack,
credits Marius Strobl (marius@FreeBSD.org)
Reviewed by: marius
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D12762
When a disk disappears and the periph is invalidated, any I/Os that
are pending with the controller can cause a crash when they
complete. Move to holding the softc reference count taken in dastart()
until the I/O is complete rather than only until xpt_action()
returns. (This approach was suggested by Ken Merry.) This extends
the method used in da to ada, nda, and mda.
Sponsored by: Netflix
Submitted by: Chuck Silvers
When a disk disappears and the periph is invalidated, any I/Os that
are pending with the controller can cause a crash when they
complete. Move to holding the softc reference count taken in dastart()
until the I/O is complete rather than only until xpt_action()
returns. (This approach was suggested by Ken Merry.)
Sponsored by: Netflix
Submitted by: Chuck Silvers
Differential Revision: https://reviews.freebsd.org/D15435
- Remove unused or dead store variable
- Remove unused function ctl_copyin_alloc
- Add missing curly brackets, this seems a regression in r287720
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D15383
ioctl frontend ports.
This revision introduces two changes to CTL:
- Changes the way options are passed to CTL_LUN_REQ and CTL_PORT_REQ ioctls.
Removes ctl_be_arg structure and associated logic and replaces it with
nv(3)-based logic for passing in and out arguments.
- Allows creating multiple ioctl frontend ports using either ctladm(8) or
ctld(8).
New frontend ports are represented by /dev/cam/ctl<pp>.<vp> nodes, eg /dev/cam/ctl5.3.
Those device nodes respond only to CTL_IO ioctl.
New command-line options for ctladm:
# creates new ioctl frontend port with using free pp and vp=0
ctladm port -c
# creates new ioctl frontend port with pp=10 and vp=0
ctladm port -c -O pp=10
# creates new ioctl frontend port with pp=11 and vp=12
ctladm port -c -O pp=11 -O vp=12
# removes port with number 4 (it's a "targ_port" number, not pp number)
ctladm port -r -p 4
New syntax for ctl.conf:
target ... {
port ioctl/<pp>
...
}
target ... {
port ioctl/<pp>/<vp>
...
Note: Most of this work was made by jceel@, thank you.
Submitted by: jceel
Reworked by: myself
Reviewed by: mav (earlier versions and recently during the rework)
Obtained from: FreeNAS and TrueOS
Relnotes: Yes
Sponsored by: iXsystems Inc.
Differential Revision: https://reviews.freebsd.org/D9299
a 1500 line switch statement. Callers now specify a discrete completion
handler, though they're still welcome to track state via ccb_state.
Sponsored by: Netflix
pointer. It's now unhelpful and misleading for callers to continue to set
it, so bring all callers into conformance. There's no real functional change,
but it makes reading the code a lot less confusing.
Sponsored by: Netflix
When the tape position is inside the Early Warning area, the tape
drive will return a sense key of NO SENSE, and an ASC/ASCQ of
0x00,0x02, which means: End-of-partition/medium detected". If
this was in response to a control command like WRITE FILEMARKS,
we correctly translate this as informational status and return
0 from saerror().
Programmable Early Warning should be handled the same way, but
we weren't handling it that way. As a result, if a PEW status
(sense key of NO SENSE, ASC/ASCQ of 0x00,0x07, "Programmable early
warning detected") came back in response to a WRITE FILEMARKS,
we returned an error.
The impact of this was that if an application was writing to a
sa(4) device, and a PEW area was set (in the Device Configuration
Extension subpage -- mode page 0x10, subpage 1), and a filemark
needed to be written on close, we could wind up returning an error
to the user on close because of a "failure" to write the filemarks.
It actually isn't a failure, but rather just a status report from
the drive, and shouldn't be treated as a failure.
sys/cam/scsi/scsi_sa.c:
For control commands in saerror(), treat asc/ascq 0x00,0x07
the same as 0x00,{0-5} -- not an error. Return 0, since
the command actually did succeed.
Reported by: Dr. Andreas Haakh <andreas@haakh.de>
Tested by: Dr. Andreas Haakh <andreas@haakh.de>
Sponsored by: Spectra Logic
MFC after: 3 days
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
have one pending. Otherwise, we can race and send two, which is
wasteful in close proximity. It can also cause the acaquire/release
count for TUR to be > 1, which is undexpected.
PR: 226510
Differential Review: https://reviews.freebsd.org/D14792
do this right, except when there's no BP and we do a TUR by request.
In that case, we clear the flag, but don't release the reference,
leaking the reference on rare occasion.
PR: 226510
Sponsored by: Netflix
<machine/stdarg.h> is a kernel-only header. The standard header for
userland is <stdarg.h>. Using the standard header in userland avoids
weird build errors when building with external compilers that include
their own stdarg.h header.
Reviewed by: arichardson, brooks, imp
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D14776
latter casts the LBA to a 32-bit number before assigning it to the 64
bit structure entity. This works fine on the first 2TB of TRIMs, but
terrible beyond that due to trucation.
Also, add an assert to make sure we don't end too many DSM TRIM
entries in one request.
Sponsored by: Netflix
The ch(4) driver was missed in change 328918, which changed
cam_periph_acquire() to return an errno instead of cam_status.
As a result, ch(4) failed to attach.
Sponsored by: Spectra Logic
softc for an async unit attention. CAM locks, sometimes, the periph
lock and other times does not. We were taking the lock always and
running into lock recursion issues on a non-recursive lock. Now we
take it selectively. It's not clear why xpt takes the lock selectively
before calling us, though, and that's still under investigation.
Reported by: avg
PR: 226510 (same panic, differnt circumstances)
Sponsored by: Netflix
The crash scenario goes like this: there's a thread waiting on "reinstate";
because it doesn't update the timeout counter it gets terminated by the
callout; at this point the maintenance thread starts the termination routine.
The first thread finishes waiting, proceeds to icl_conn_handoff(), and drops
the refcount, which allows the maintenance thread to free its resources. At
this point another thread receives a PDU. Boom.
PR: 222898, 219866
Reported by: Eugene M. Zheganin <emz at norma.perm.ru>
Tested by: Eugene M. Zheganin <emz at norma.perm.ru>
Reviewed by: mav@ (earlier version)
MFC after: 2 weeks
Sponsored by: playkey.net
kern.cam.{,a,n}da.X.invalidate=1 forces *daX to detach by calling
cam_periph_invalidate on the underlying periph. This is for testing
purposes only. Include only with options CAM_TEST_FAILURE and rename
the former [AN]DA_TEST_FAILURE, and fix nda to compile with it set.
We're using it at work to harden geom and the buffer cache to be
resilient in the face of drive failure. Today, it far too often
results in a panic. While much work was done on SIM initiated removal
for the USB thumnb drive removal work, little has been done for periph
initiated removal. This simulates what *daerror() does for some errors
nicely: we get the same panics with it that we do with failing drives.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14581
When multiple trims are in the queue, collapse them as much as
possible. At present, this usually results in only a few trims being
collapsed together, but more work on that will make it possible to do
hundreds (up to some configurable max).
Sponsored by: Netflix
When the ccb is NULL to cam_iosched_bio_complete, just update the
other statistics, but not the time. If many operations are collapsed
together, this is needed to keep stats properly for the grouped bp.
This should fix trim accounting.
Sponsored by: Netflix
The ioctl objects contain pointers and require translation and some
refactoring of the infrastructure to work. For now prevent opertion
on garbage values. This is very slightly overbroad in that ENCIOC_INIT
is safe.
Reviewed by: imp, kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14671
These take a union ccb argument which is full of kernel pointers.
Substantial translation efforts would be required to make this work.
By rejecting the request we avoid processing or returning entierly
wrong data.
Reviewed by: imp, ken, markj, cem
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14654
Make sure the periph lock is held around rmw access to softc data,
espeically flags, including work flags in iosched.
Add asserts for the periph lock where it should be held.
PR: 226510
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D14456
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.
NVMe is now working on powerpc64 host.
Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
Now that we're queueing BIO_DELETE requests in the CAM I/O scheduler,
it make sense to try to combine as many as possible into a single
request to send down to hardware. Hopefully, lots of larger requests
like this are better than lots of individual transactions.
Note for future: need to limit based on total size of the trim
request. Should also collapse adjacent ranges where possible to
increase the size of the max payload.
Sponsored by: Netflix
Introduce flags word to describe the capacities of the peripheral.
First bit will describe if the periph driver allows multiple
outstanding TRIMS to be active in a device.
Modify the I/O scheduler so that the nda driver can queue trims
for a while after the first one arrives. We'll queue until we see
a I/O scheduler tick, then we'll schedule as many TRIMs as allowed
by other factors (currently this is slocts in the NVMe controller).
This mariginally helps the read latency issues we see with reads,
but sets the stage for the nda driver to do TRIM collapsing like the
da and ada drivers do today.
Sponsored by: Netflix
To help implement a policy of 'queue all trims until next I/O sched
tick' policy to help coalesce them, note when we tick so we can do
something special on the first call after the tick to get more work.
Sponsored by: Netflix
While the code for ada and da both assume that the trim list is
ordered when doing the coaleascing the TRIMs, it turns out that
creating the sorted list uses more resources than are saved by having
slightly fewer trims sent to the device.
Sponsored by: Netflix
There's no compelling reason to return a cam_status type for this
function and doing so only creates confusion with normal C
coding practices. It's technically an API change, but the periph API
isn't widely used. No efffective change to operation.
Reviewed by: imp, mav, ken
Sponsored by: Netflix
Differential Revision: D14063
here. Return ENOMEM when we can't malloc a buffer for the DSM
TRIM. This should fix the WITNESS warnings similar to the following:
uma_zalloc_arg: zone "16" with the following non-sleepable locks held:
exclusive sleep mutex CAM device lock (CAM device lock) r = 0 (0xfffff800080c34d0) locked @ /usr/src/sys/cam/nvme/nvme_da.c:351
Reviewed by: scottl@
Sponsored by: Netflix
virtual address sizes
Summary:
Some architectures use physical addresses larger than virtual. This is the
minimal changeset needed to get CAM/CTL to build on these targets. No
functional changes. More changes would likely be needed for this to be fully
functional on said platforms, but they can be made when needed.
Reviewed By: mav, chuck
Differential Revision: https://reviews.freebsd.org/D14041
leaks. We assume each source can be taken / dropped only once and
don't recurse. These are only enabled via DA_TRACK_REFS or
INVARIANTS. There appreas to be a reference leak under extreme load,
and these should help us colaberatively work it out. It also documents
better the reference / holding protocol better.
Reviewed by: ken@, scottl@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14040
device still wind up in xpt_done after the path has been
invalidated. Since we don't always need sim or devq, add some guard
rails to only fail if we have to use them.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14040
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.
MFC after: 3 weeks
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Make it possible to retrieve mmc parameters via the XPT_GET_ADVINFO
call instead. Convert camcontrol to the new scheme.
Reviewed by: imp. kibab
Sponsored by: Netflix
Differential Revision: D13868
Move the the NetBSD tag near to the related licence. Update it to reflect
better the point where we started diverging.
Use grouping parenthesis for the SPDX tag.
No functional change.
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
Ths change consists of two parts.
geom_disk: deny opening a disk for writing if it's marked as
write-protected. A new disk(9) flag is added to mark write protected
disks. A possible alternative could be to add another parameter to d_open,
so that the open mode could be passed to it and the disk drivers could
make the decision internally, but the flag required less churn.
scsi_da: add a new phase of disk probing to query the all pages mode
sense page. We can determine if the disk is write protected using bit 7
of the device specific field in the mode parameter header returned by
MODE SENSE.
PR: 224037
Reviewed by: mav
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D13360
shutdown_post_sync event. For adashutdown, this causes problems
because we need to poll for completion of the commands, but we're not
yet officially dumping yet, so the code from r326964 assumed we could
use the interrupt-driven commands rather than the polled ones. This
lead to a hang. Prevent this by also checking to see if the scheduler
is stopped to do the polling.
Reported by: markj@
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D13845
inline ccbque.h into scsi_low.h. The file isn't MD, so shouldn't live
in i386/isa. It's only used by scsi_low, so move it there so no new
clients accidentally grow. scsi_low may not even still work, and the
locking here is still SPL based. CAM should do the right thing, but
I've received no reports of these cards still working. At least it
compiles still and there's one fewer files in sys/i386/isa. While I'm
here, ansify and de-splize. CCB_MWANTED appears to be a clear-only
flag, but I've not changed that.
Differential Review: https://reviews.freebsd.org/D13672
from the ada and da dump routines. This avoids difficult locking
problems from needing to be handled. While it might seem like this
would leave the periphs unprotected during dump, they were aleady
at risk of unexpected removal due to the dump functions not
keeping refcount state across the many calls that come in during
a dump. This is an exercise for future work.
Obtained from: Netflix
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
changed worked to capture dumps for me. However, the test for
SCHEDULER_STOPPED() isn't right. We can also call the dump routine
from ddb, in which case the scheduler is still running. This leads to
an assertion panic that we're sleeping when we shouldn't. Instead, use
the proper test for dumping or not. This brings us in line with other
places that do special things while we're doing polled I/O like this.
Noticed by: pho@
Differential Revision: https://reviews.freebsd.org/D13531
This provides a nice wrarpper around the XPT_PATH_INQ ccb creation and
calling.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13387
kernel scheduler is stopped, replace the by hand calling of
xpt_polled_action() with it.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13388
adagetparams contains a sign-extension error that will cause the sector
count to be incorrectly calculated for ATA disks of >=1TiB that still use
CHS addressing. Disks using LBA48 addressing are unaffected.
Reported by: Coverity
CID: 1007296
Reviewed by: ken
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D13198
currently harmless for AC_UNIT_ATTENTION event (cam_periph_async does
nothing with them), it's still in error because if it were to start in
the future, it would be done twice.
Sponsored by: Netflix
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Like its predecessor ST8000AS0002, this is a drive-managed SMR drive, but
doesn't declare that in its ATA identify data.
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
sys/cam/scsi/scsi_da.c
Complete BIO_FLUSH commands immediately if the da(4) device hasn't
been written to since the last flush. If we haven't written to the
device, there is no reason to send a flush.
Submitted by: gibbs
Reviewed by: imp
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D13106
* Wrongly matches strings that are shorter than the pattern
* Fails to match negative character sets
* Fails to match character sets that aren't at the end of the pattern
* Fails to match character ranges
Reviewed by: imp
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D13173
In scsi_dev_advinfo(), if the physical path is being stored and there is a
malloc failure (malloc(9) is called with M_NOWAIT), we could wind up in a
situation where the device's physpath_len is set to the length the user
provided, but the physpath itself is NULL.
If another context then comes in to fetch the physical path value, we would
wind up trying to memcpy a NULL pointer into the caller's buffer.
So, set the physpath_len to 0 when we free the physpath on entry into the
store case for the physical path. Reset the length to a non-zero value only
after we've successfully malloced a buffer to hold it.
Submitted by: ken
Reviewed by: asomers
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
In xpt_bus_register(), remove superfluous call to free(). This was mostly
benign since free(9) checks for NULL before doing anything, and
xpt_create_path() is nice enough to NULL out the pointer on failure.
However, it could've segfaulted if malloc(9) failed during
xpt_create_path().
Submitted by: gibbs
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
information for that and XPT_PATH_INQ. Provide macros to encode/decode
major/minor versions. Read the link speed and lane count to compute
the base_transfer_speed for XPT_PATH_INQ.
Sponsored by: Netflix
We must send either an IDLE IMMEDIATE or a STANDBY IMMEDIATE to drives
on warm boot so their SMART and other volatile data is
persisted. However, for a warm boot we don't want to send STANDBY
IMMEDIATE to some spinning drives because they will spin down. If
there's a lot of these drives on the system, that can cause a
thundering herd problem at startup time (that in extreme cases causes
timeout in device discovery).
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D12811
To save SMART data and for a drive to understand that it's been nicely
shutdown, we need to send a STANDBY IMMEDIATE. Modify adaspindown to
use a local CCB on the stack. When we're panicing, used
xpt_polled_action rather than cam_periph_runccb so that we can SEND
IMMEDIATE after we've shutdown the scheduler.
Sponsored by: Netflix
Reviewed by: scottl@, gallatin@
Differential Revision: https://reviews.freebsd.org/D12799
When limiting I/O, a value of 0 makes no sense as a limit. No progress
can be made. Trade the possibility that someone might be doing
something clever to achieve ultra-low I/O limits vs the damage of not
ever making progress on an I/O in favor of making progress. Now the
machine won't be useless if this accidentally gets requested.
Sponsored by: Netflix
allocations (for req and ccb, which ultimately contain the
nvme_cmd). As such, we can micro-optimize these routines. Add a
comment to this effect, and bzero the ccb used to make the requests
for the nda dump rotuine so it more closely matches a ccb allocated
with xpt_get_ccb().
Sponsored by: Netflix
Prevent cam_iosched_iops_tick() from discarding 'unspent' ios unless
it's a new accounting interval.
Previously ios that weren't used between ticks were lost, as a result
the iops limiter could enforce a limit below the configured maximum.
Obtained from: ElectroBSD
Submitted by: Fabian Keil
PR: 221974
Previously the iops limiter would always allow at least
quanta ios per second as cam_iosched_iops_tick() never set
ios->l_value1 below 1.
Submitted by: Fabian Keil <fk@fabiankeil.de>
Obtained from: ElectroBSD
PR: 221974
Previously ios->current was set to 0 until the first
cam_iosched_cl_maybe_steer() call.
PR: 221954
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12349
Previously callout_reset() was called with a "ticks" value that was
off by one. As a result cam_iosched_ticker() was called a bit too
frequently: On systems with hz=1000 a quanta value of 200 resulted in
~250 calls and a value of 100 in ~111 calls.
For the "queue_depth" and "bandwidth" limiters the difference doesn't
matter but the "iops" limiter depends on the scheduling to enforce the
correct maximum.
PR: 221956
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12350
Invalid values can result in devision-by-zero panics or other
undefined behaviour so lets not allow them.
PR: 221957
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12351
Use the write queue for BIO_ZONE commands so they can't get executed
ahead of writes that were sent after them. More generally, since they
introduce strong ordering into the list, they need to go to the write
queue (which is the only queue that BIO_ORDERED is honored for at the
moment). In fact, fix mismatch between queueing and dequeueing code by
changing this to queue all non-reads (and non-trims) to the write
queue.
As a side effect this prevents the kernel message:
kernel: Found bio_cmd = 0x9
which cam_iosched_next_bio() emits when finding commands
other than BIO_READ in the read queue.
PR: 221973
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12353
kern.features.mmcam will be present and equal to 1 if the kernel has been
compiled with option MMCCAM.
This will help sdio-related userland tools to fail-fast if running on the kernel
without MMCCAM enabled.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D12386
Don't call cam_iosched_trim_done or cam_iosched_submit_trim for nda
since its hardware can handle almost an arbitrary number of TRIMs and
we don't have to be careful to only ever do one.
Sponsored by: Netflix
It's intended only for those situations where the periph driver
ones to limit the number of trims active to one and only one.
Also update comments on associated functions.
Sponsored by: Netflix
* Demote the level of several debug messages to CAM_DEBUG_TRACE
* Add detection for SDHC cards that can do 1.8V. No voltage switch sequence
is issued yet;
* Don't create a separate LUN for each SDIO function. We need just one to make
pass(4) attach;
* Remove obsolete mmc_sdio* files. SDIO functionality will be moved into the
separate device that will manage a new sdio(4) bus;
* Terminate probing if got no reply to CMD0;
* Make bcm2835 SDHCI host controller driver compile with 'option MMCCAM'.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D12109
The cam_iosched_ticker() can't be scheduled more than once per tick.
Some limiters depend on quanta matching the number of calls per second
to enforce the proper limits. Limit the quanta to no faster than 1 per
clock tick. This fixes some features when running in VMs where the
default HZ is 100.
PR: 221953
Obtained from: ElectroBSD
Differential Revision: https://reviews.freebsd.org/D12337
Submitted by: Fabian Keil
It's awkward to have spaces in CAM device serial numbers. That leads to
such things as device nodes named "/dev/diskid/MYSERIAL%20%20%201". Better
to replace the spaces with "0"s. This change only affects the default
serial numbers for users who don't provide their own.
Reviewed by: ken, mav
MFC after: Never
Relnotes: Yes
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D12263
When bcopy is treated as memcpy/memmove, Clang produces warnings that the
size argument doesn't match the type of the source. This is true, it
doesn't match; we're aliasing the source.
Explicitly cast the source pointer to the expected type to remove the
warning.
No functional change.
Sponsored by: Dell EMC Isilon
This patch changes the way XPT_GDEV_TYPE works for NVMe. The current
ccb_getdev structure includes pointers to the NVMe Identify Controller
and Namespace structures, but these are kernel virtual addresses which
are not accessible from user space.
As an alternative, the patch changes the pointers into padding in
ccb_getdev and adds two new types to ccb_dev_advinfo to retrieve the
Identify Controller (CDAI_TYPE_NVME_CNTRL) and Namespace
(CDAI_TYPE_NVME_NS) data structures.
Reviewed By: rpokala, imp
Differential Revision: https://reviews.freebsd.org/D10466
Submitted by: Chuck Tuffli
Tuffli had submitted a more thorough patch that I was unaware of when
I did my work and this brings in the bits I missed from that patch.
PR: 220267
Submitted by: Chuck Tuffli
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.
Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
extreme outliers from dodgy drives. Adjust comments to reflect this,
and make sure that the number of latency buckets match in the two
places where it matters.
o Allow I/O scheduler to gather times on 32-bit systems. We do this by shifting
the sbintime_t over by 8 bits and truncating to 32-bits. This gives us 8.24
time. This is sufficient both in range (256 seconds is about 128x what current
users need) and precision (60ns easily meets the 1ms smallest bucket size
measurements). 64-bit systems are unchanged. Centralize all the time math so
it's easy to tweak tha range / precision tradeoffs in the future.
o While I'm here, the I/O scheduler should be using periph_data rather than
sim_data since it is operating on behalf of the periph.
Differential Review: https://reviews.freebsd.org/D12119
All ndaX and ndaXpY nodes will appear as nvdX and nvdXpY as well
(through symlinks in devfs via the normal disk aliasing mechanism in
GEOM).
Differential Revision: https://reviews.freebsd.org/D11873
The attached patch lets adaasync() set ADA_STATE_WCACHE based on
ADA_FLAG_CAN_WCACHE instead of ADA_FLAG_CAN_RAHEAD.
This fixes a regression introduced in r300207 which changed
the flag names.
PR: 220948
Submitted by: Fabian Keil <fk@fabiankeil.de>
Obtained from: ElectroBSD
MFC after: 1 week
the IO type (Admin or NVM) using XPT op-codes XPT_NVME_ADMIN or
XPT_NVME_IO.
Submitted by: Chuck Tuffli <chuck@tuffli.net>
Differential Revision: https://reviews.freebsd.org/D10247
Implement the MMC/SD/SDIO protocol within a CAM framework. CAM's
flexible queueing will make it easier to write non-storage drivers
than the legacy stack. SDIO drivers from both the kernel and as
userland daemons are possible, though much of that functionality will
come later.
Some of the CAM integration isn't complete (there are sleeps in the
device probe state machine, for example), but those minor issues can
be improved in-tree more easily than out of tree and shouldn't gate
progress on other fronts. Appologies to reviews if specific items
have been overlooked.
Submitted by: Ilya Bakulin
Reviewed by: emaste, imp, mav, adrian, ian
Differential Review: https://reviews.freebsd.org/D4761
merge with first commit, various compile hacks.
If a peripheral driver (e.g. da, sa, cd) is added or removed from the
peripheral driver list while an unrelated peripheral driver instance (e.g.
da0, sa5, cd2) is going away and is inside camperiphfree(), we could
dereference an invalid pointer.
When peripheral drivers are added or removed (see periphdriver_register()
and periphdriver_unregister()), the peripheral driver array is resized
and existing entries are moved.
Although we hold the topology lock while we traverse the peripheral driver
list, we retain a pointer to the location of the peripheral driver pointer
and then drop the topology lock. So we are still vulnerable to the list
getting moved around while the lock is dropped.
To solve the problem, cache a copy of the peripheral driver pointer. If
its storage location in the list changes while we have the lock dropped, it
won't have any effect.
This doesn't solve the issue that peripheral drivers ("da", "cd", as opposed
to individual instances like "da0", "cd0") are not generally part of a
reference counting scheme to guard against deregistering them while there
are instances active. The caller (generally the person unloading a module)
has to be aware of active drivers and not unload something that is in use.
sys/cam/cam_periph.c:
In camperiphfree(), cache a pointer to the peripheral driver
instance to avoid holding a pointer to an invalid memory location
in the event that the peripheral driver list changes while we have
the topology lock dropped.
PR: kern/219701
Submitted by: avg
MFC after: 3 days
Sponsored by: Spectra Logic
Without the allocation length set, the target will either reject
the command or complete it without transferring any data.
This fixes the REPORT ZONES command for SCSI ZBC protocol devices,
as well as ATA ZAC protocol devices that are behind a SCSI to ATA
translation layer. (LSI/Broadcom's 12Gb SAS adapters translate ZBC
commands to ZAC commands.) Those are Host Aware and Host Managed SMR
drives.
This will fix REPORT ZONE commands sent to the da(4) driver via the
GEOM bio interface and zonectl, and REPORT ZONE commands sent from
camcontrol(8).
Note that in the case of camcontrol(8), we currently only send
SCSI ZBC commands to native SCSI protocol devices, not ATA devices
behind a SAT layer.
sys/cam/scsi/scsi_da.c:
Fill in the length field in scsi_zbc_in().
MFC after: 3 days
Sponsored by: Spectra Logic
After r307132 the sbuf buffer is malloc()ed, but corresponding
sbuf_delete() call was missing.
Fix a nearby whitespace bug.
MFC after: 3 days
Sponsored by: Dell EMC Isilon
If the user issues a MTIOCEXTGET ioctl, and the tape drive in question has
a serial number that is longer than 80 characters, we malloc a buffer in
saextget() to hold the output of cam_strvis().
Since a mutex is held in that codepath, doing a M_WAITOK malloc could lead
to sleeping while holding a mutex. Change it to a M_NOWAIT malloc and bail
out if we fail to allocate the memory. Devices with serial numbers longer
than 80 bytes are very rare (I don't recall seeing one), so this
should be a very unusual case to hit. But it is a bug that should be fixed.
sys/cam/scsi/scsi_sa.c:
In saextget(), if we need to malloc a buffer to hold the output of
cam_strvis(), don't wait for the memory. Fail and return an error
if we can't allocate the memory immediately.
PR: kern/220094
Submitted by: Jia-Ju Bai <baijiaju1990@163.com>
MFC after: 3 days
Sponsored by: Spectra Logic
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
The Genesys chip is failing when issueing READ_CAP(16) command.
Force a quirk to disable it and use READ_CAP(10) instead.
Also, depending on used firmware, GL3224 can be recognized
either as 'storage device' or 'mass storage class' -
enable both variants in scsi_quirk_table.
Submitted by: Wojciech Macek <wma@semihalf.com>
Konrad Adamczyk <ka@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield
Reviewed by: mav
Differential revision: https://reviews.freebsd.org/D10902
The motivation for this is two-fold.
1. Some old WD SATA disks may appear as if they need to be spun up
when they are already spinning. Those disks would respond with
an error to the spin-up request.
2. Even if we really fail to spin up the disk, we still can try to
proceed to the subsequent phases. If we fail later on, then no
difference. Otherwise we get a chance to communicate with the
disk which is better than completely ignoring it, because a user
can try to recover the disk.
Reviewed by: mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D10896
This will help application developers simulate end of tape conditions.
To inject an error in sa0:
sysctl kern.cam.sa.0.inject_eom=1
This will return the next read or write request queued with 0 bytes
written. Any subsequent writes or reads will go along as usual.
This will also cause the early warning position flag to get set
for the next position query. So, 'mt status' will show the BPEW
(Beyond Programmable Early Warning) flag on the first query after
an error injection. After that, the position flags will be as they
are in the underlying tape drive.
Also, update the sa(4) man page to describe tape parameters,
which can be set via 'mt param'.
sys/cam/scsi/scsi_sa.c:
In saregister(), create the inject_eom sysctl variable.
In sastart(), check to see whether inject_eom is set. If
so, return the read or write with 0 bytes written to
indicate EOM. Set the set_pews_status flag so that we
fake PEWS status in the next position call for reads, and the
next 3 calls for writes. This allows the user to see the BPEW
flag one time via 'mt status'.
In sagetpos(), check the set_pews_status flag and fake
PEWS status and decrement the counter if it is set.
share/man/man4/sa.4:
Document the inject_eom sysctl variable.
Document all of the parameters currently supported via
'mt param'.
usr.bin/mt/mt.1:
Point the user to the sa(4) man page for more details on
supported parameters.
MFC after: 3 days
Sponsored by: Spectra Logic
sys/cam/scsi/scsi_all.h:
Add the SCSI Solid State Media log page (0x11) structure
definition. This gives the percentage used (in terms of
lifetime flash wear) of an SSD.
MFC after: 3 days
Sponsored by: Spectra Logic
After FreeBSD SVN revision 236814, the pass(4) driver changed from
only doing error recovery when the CAM_PASS_ERR_RECOVER flag was
set on a CCB to sometimes doing error recovery if the passed in
retry count was non-zero.
Error recovery would happen if two conditions were met:
1. The error recovery action was simply a retry. (Which is most
cases.)
2. The retry_count is non-zero. (Which happened a lot because of
cut-and-pasted code.)
This explains a bug I noticed in with camcontrol:
# camcontrol tur da34 -v
Unit is ready
# camcontrol reset da34
Reset of 1:172:0 was successful
At this point, there should be a Unit Attention:
# camcontrol tur da34 -v
Unit is ready
No Unit Attention.
Try it again:
# camcontrol reset da34
Reset of 1:172:0 was successful
Now set the retry_count to 0 for the TUR:
# camcontrol tur da34 -v -C 0
Unit is not ready
(pass42:mps1:0:172:0): TEST UNIT READY. CDB: 00 00 00 00 00 00
(pass42:mps1:0:172:0): CAM status: SCSI Status Error
(pass42:mps1:0:172:0): SCSI status: Check Condition
(pass42:mps1:0:172:0): SCSI sense: UNIT ATTENTION asc:29,2 (SCSI bus reset occurred)
(pass42:mps1:0:172:0): Field Replaceable Unit: 2
There is the unit attention. camcontrol(8) has a default
retry_count of 1, in case someone sets the -E flag without
setting -C.
The CAM_PASS_ERR_RECOVER behavior was only broken with the
CAMIOCOMMAND ioctl, which is the synchronous pass(4) API. It has
worked as intended (error recovery is only done when the flag
is set) in the asynchronous API (CAMIOQUEUE ioctl).
sys/cam/scsi/scsi_pass.c:
In passsendccb(), when calling cam_periph_runccb(), only
specify the error routine when CAM_PASS_ERR_RECOVER is set.
share/man/man4/pass.4:
Document that CAM_PASS_ERR_RECOVER is needed to enable
error recovery.
Reported by: Terry Kennedy <TERRY@glaver.org>
PR: kern/218572
MFC after: 1 week
Sponsored by: Spectra Logic
sys/cam/scsi/scsi_all.c:
In the asc_table, if we get a 0x20,0x02 error ("Access denied -
no access rights"), don't bother retrying. Instead, immediately
fail the command.
This is the error returned by Self Encrypting Drives (SED) when
they are locked.
MFC after: 3 days
Sponsored by: Spectra Logic
using a driver-supplied sbuf for printing device discovery
announcements. This helps ensure that messages to the console
will be properly serialized (through sbuf_putbuf) and not be
truncated and interleaved with other messages. The
infrastructure mirrors the existing xpt_announce_periph()
entry point and is opt-in for now. No content or formatting
changes are visible to the operator other than the new coherency.
While here, eliminate the stack usage of the temporary
announcement buffer in some of the drivers. It's moved to the
softc for now, but future work will eliminate it entirely by
making the code flow more linear. Future work will also address
locking so that the sbufs can be dynamically sized.
The scsi_da, scs_cd, scsi_ses, and ata_da drivers are converted
at this point, other drivers can be converted at a later date.
A tunable+sysctl, kern.cam.announce_nosbuf, exists for testing
purposes but will be removed later.
TODO:
Eliminate all of the code duplication and temporary buffers. The
old printf-based methods will be retired, and xpt_announce_periph()
will just be a wrapper that uses a dynamically sized sbuf. This
requires that the register and deregister paths be made malloc-safe,
which they aren't currently.
Sponsored by: Netflix
According to Warner, multiple TRIM BIOs are collapsed into a single CCB with
NULL bp. It is invalid to biotrack() NULL, and results in a fault. So,
don't do that.
Reported by: asomers@
Sponsored by: Dell EMC Isilon
The goal of this work is to remove the explicit dependency for ctl(4)
on iscsi(4), so end-users without iscsi(4) support in the kernel can
use ctl(4) for its other functions.
This allows those without iscsi(4) support built into the kernel to use
ctl(4) as a test mechanism. As a sidenote, this was possible around the
10.0-RELEASE period, but made impossible for end-users without iscsi(4)
between 10.0-RELEASE and 11.0-RELEASE.
Automatically load cfiscsi(4) from ctladm(8) and ctld(8) for backwards
compatibility with previously releases. The automatic loading feature is
compiled into the beforementioned tools if MK_ISCSI == yes when building
world.
Add a manpage for cfiscsi(4) and refer to it in ctl(4).
Differential Revision: D10099
MFC after: 2 months
Relnotes: yes
Reviewed by: mav, trasz
Sponsored by: Dell EMC Isilon
I think this message is not very useful for end user. Also its formatting
does not match other messages printed at that time. Those who really need
this information can always find it in `camcontrol negotiate daX -v`.
MFC after: 2 weeks
For three years now CAM does not use SIM lock, but still enforces SIM to
use it. Remove this requirement, allowing SIMs to have any locking they
prefer, if they pass no mutex to cam_sim_alloc().
MFC after: 2 weeks
Some SIMs report much less untagged device openings then tagged ones.
Target mode devices are not handled by regular probing routines, and so
there is nothing to increase queue size for them to the SIM's maximum.
To fix that resize the queue explicitly on ctl periph registration.
This radically improves performance of mpt(4) in target mode.
Also fetch and report device queue statistics in `ctladm dumpstructs`,
since regular way of `camcontrol tags` is not usable in target mode.
MFC after: 2 weeks
Queue statistics has nothing to do with presence or absence of INQUIRY
data, etc. Target mode devices are never configured, but have queues.
MFC after: 2 weeks
Some SIMs may not abort them implicitly, that either fail the LUN disable
request or just make us wait for those CCBs forever. With this change
I can successfully disable LUNs on mpt(4). For isp(4), which aborts them
implicitly, this change should be irrelevant.
MFC after: 2 weeks
Report UNMAP granularity as stripesize/-offset if we have no other values
to report there.
Add new quirk DA_Q_STRICT_UNMAP for cases when target is too critical to
misaligned UNMAP request, reporting errors instead of being suboptimal.
Setting this quirk makes da periph to forcefully align all UNMAP requests
to avoid those errors by the cost of some odd ranges not being UNMAP'ed.
This makes UNMAP usable within VMware 6.x VMs, just now 100% efficient.
MFC after: 2 weeks
For now it allows to unload CTL kernel module if there are no target-capable
SIMs in CAM. As next step full teardown of CAM targets can be implemented.
CAM_UNLOCKED is internal flag and cannot correctly be set by userland.
Return EINVAL from CAMIOCOMMAND and CAMIOQUEUE if it is set.
Also fix leaks in some of the error paths for CAMIOQUEUE.
PR: 215356
Reviewed by: ken, mav
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D9869
The biggest change is that ctl_remove_initiator() now generates I_T NEXUS
LOSS event, cleaning part of LUs state related to the initiator.
MFC after: 2 weeks
If we asked to send sense data by setting CAM_SEND_SENSE, but SIM didn't
confirm transmission by setting CAM_SENT_SENSE, assume it was not sent.
Queue the I/O back to CTL for later REQUEST SENSE with ctl_queue_sense().
This is needed for error reporting on SPI HBAs like ahc(4)/ahd(4).
MFC after: 2 weeks
This code was disabled due to its high memory usage. But now we need this
functionality for cfumass(4) frontend, since USB MS BBB transport does not
support autosense.
MFC after: 2 weeks
When LUN is disabled, SIM starts returning queued ATIOs/INOTs. But at the
same time there can be some ATIOs/INOTs still carrying real new requests.
If we free those, SIM may leak some resources, forever expecting for any
response from us. So try to be careful, separating ATIOs/INOTs carrying
requests which still must be processed, from ATIOs/INOTs completed with
errors which can be freed.
MFC after: 2 weeks
Before this change XCOPY code could allocate memory in chunks up to 16-32MB
(VMware does XCOPY in 4MB chunks by default), that could be difficult for
VM subsystem to do due to KVA fragmentation, that sometimes created huge
allocation delays, blocking any I/O for respective LU for that time.
This change limits allocations down to TPC_MAX_IO_SIZE, which is 1MB now.
1MB is also not a cookie, but ZFS also can do that for large blocks, so
it should be less dramatic. As drawback this increases CPU overhead, but
it still look acceptable comparing to time consumed by ZFS read/write.
MFC after: 1 week
sys/cam/ctl/ctl.c:
In ctl_datamove(), inside CTL_IO_DELAY, add a lun variable and fill
it in before trying to dereference it.
MFC after: 3 days
Sponsored by: Spectra Logic
Before this change MaxCmdSN was reported as CmdSN + delta, that made it
limit number of requests in transmission from the initiator to target,
that was pretty useless. After this change MaxCmdSN limits number of
requests queued to CTL, i.e. maximal queue depth for the initiator.
The default limit is 256 outstanding requests per initiator at a time.
This code uses existing cs_outstanding_ctl_pdus counter to track queue
depth. It's semantics doen't perfectly match, but close enough to not
add another counter. Just don't set the maxtags below 2.
MFC after: 2 weeks
all of them in terms of an sbuf-based back-end, xpt_path_sbuf. This
unifies the implementation, but more importantly it stops the output
fropm being split between 4 or more invocations of printf. The
multiple invocations cause interleaving of the messages on the
console during boot, making the output of disk discovery often
unintelligible. This change helps a lot, but more work is needed.
Reviewed by: ken, mav
Sponsored by: Netflix
If "capacity" LU option is set, ramdisk backend now implements featured
thin provisioned disk, storing data in malloc(9) allocated memory blocks
of pblocksize bytes (default PAGE_SIZE or 4KB). Additionally ~0.2% of LU
size is used for indirection tree (bigger pblocksize reduce the overhead).
Backend supports all unmap and anchor operations. If configured capacity
is overflowed, proper error conditions are reported.
If "capacity" LU option is not set, the backend operates mostly the same
as before without allocating real storage: writes go to nowhere, reads
return zeroes, reporting that all LBAs are unmapped.
This backend is still mostly oriented on testing and benchmarking (it is
still a volatile RAM disk), but now it should allow to run real FS tests,
not only simple dumb dd.
MFC after: 2 weeks
It is only a first step and not perfect, but better then nothing.
The main blocker is CAM target frontend, that can not be unloaded,
since CAM does not have mechanism to unregister periph driver now.
MFC after: 2 weeks
The length of the scsi_set_timestamp_parameters struct was incorrect. LTO-5
drives don't care, but LTO-7 drives do.
Reviewed by: Sam Klopsch
MFC after: 2 weeks
Sponsored by: Spectra Logic Corp
This field has no practical use and never readed. Initiators already
receive respective residual size from frontends. Removed field had
different semantics, which looks useless, and was never passed through
by any frontend.
While there, fix kern_data_resid field support in case of HA, missed in
r312291.
MFC after: 13 days
It seems like kern_data_resid was never really implemented. This change
finally does it. Now frontends update this field while transferring data,
while CTL/backends getting it can more flexibly handle the result.
At this point behavior should not change significantly, still reporting
errors on write overrun, but that may be changed later, if we decide so.
CAM target frontend still does not properly handle overruns due to CAM API
limitations. We may need to add some fields to struct ccb_accept_tio to
pass information about initiator requested transfer size(s).
MFC after: 2 weeks
Replace archaic "busses" with modern form "buses."
Intentionally excluded:
* Old/random drivers I didn't recognize
* Old hardware in general
* Use of "busses" in code as identifiers
No functional change.
http://grammarist.com/spelling/buses-busses/
PR: 216099
Reported by: bltsrc at mail.ru
Sponsored by: Dell EMC Isilon
Instead of collecting statistics for each combination of ports and logical
units, that consumed ~45KB per LU with present number of ports, collect
separate statistics for every port and every logical unit separately, that
consume only 176 bytes per each single LU/port. This reduces struct
ctl_lun size down to just 6KB.
Also new IOCTL API/ABI does not hardcode number of LUs/ports, and should
allow handling of very large quantities.
MFC after: 2 weeks (probably keeping old API enabled for some time)
This array takes 64KB of RAM now, that was more then half of struct ctl_lun
size. If at some point we support more ports, this may need another tune.
MFC after: 2 weeks
The sim_vid, hba_vid, and dev_name fields of struct ccb_pathinq are
fixed-length strings. AFAICT the only place they're read is in
sbin/camcontrol/camcontrol.c, which assumes they'll be null-terminated.
However, the kernel doesn't null-terminate them. A bunch of copy-pasted code
uses strncpy to write them, and doesn't guarantee null-termination. For at
least 4 drivers (mpr, mps, ciss, and hyperv), the hba_vid field actually
overflows. You can see the result by doing "camcontrol negotiate da0 -v".
This change null-terminates those fields everywhere they're set in the
kernel. It also shortens a few strings to ensure they'll fit within the
16-character field.
PR: 215474
Reported by: Coverity
CID: 1009997 1010000 1010001 1010002 1010003 1010004 1010005
CID: 1331519 1010006 1215097 1010007 1288967 1010008 1306000
CID: 1211924 1010009 1010010 1010011 1010012 1010013 1010014
CID: 1147190 1010017 1010016 1010018 1216435 1010020 1010021
CID: 1010022 1009666 1018185 1010023 1010025 1010026 1010027
CID: 1010028 1010029 1010030 1010031 1010033 1018186 1018187
CID: 1010035 1010036 1010042 1010041 1010040 1010039
Reviewed by: imp, sephe, slm
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D9037
Differential Revision: https://reviews.freebsd.org/D9038
observed to fix any actual error, but it's the right thing to do
from the correctness point of view.
Tested by: Eugene M. Zheganin <emz at norma.perm.ru>
MFC after: 1 month
- Since I/Os are allocates from per-port pools, make allocations store
pointer to CTL softc there, and use it where needed instead of global.
- Created bunch of helper macros to access LUN, port and CTL softc.
MFC after: 2 weeks
We do not have non-volatile memory to really save those values, so we
neither report nor support this capability. Also saved mode pages are
not replicated between HA peers now.
MFC after: 2 weeks
Those two values are not directly related, so make them independent.
This does not change any limits immediately, but makes number of LUNs
per port controllable via tunable/sysctl kern.cam.ctl.lun_map_size.
After this change increasing CTL_MAX_LUNS should be pretty cheap,
and even making it tunable should be easy.
MFC after: 2 weeks
For EXTENDED COPY:
- improve parameters checking to report some errors before copy start;
- forward sense data from copy target as descriptor in case of error;
- report which CSCD reported error in sense key specific information.
For WRITE USING TOKEN:
- pass through real sense data from copy target instead of reporting
our copy error, since for initiator its a "simple" write, not a copy.
MFC after: 2 weeks