Commit Graph

2298 Commits

Author SHA1 Message Date
Alexander Motin
024932aae9 Use atomic for start_count in devstat_start_transaction().
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places.  It would be cool if we had
more SMP-friendly statistics, but this helps too.

Sponsored by:	iXsystems, Inc.
2019-12-30 03:13:38 +00:00
Alexander Motin
c389a786dd Make pass(4) handle misaligned buffers of MAXPHYS size.
Since we are already using malloc()+copyin()/copyout() for smaller data
blocks, and since new asynchronous API does it always, I see no reason
to keep this ugly artificial size/alignment limitation in old API.

Tape applications suffer enough from the MAXPHYS limitations by itself,
and additional alignment requirement, often halving effectively usable
block size, does not help.

It would be good to use unmapped I/O here instead, but it require some
HBA drivers polishing first to support non-BIO unmapped buffers.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-12-23 20:41:55 +00:00
Warner Losh
ece56614c8 Revert r355833
While it works on nda, it fails on ada and/or da for at least zfs with a modify
after free issue on a trim BIO. Revert while I rework it to fix those devices.
2019-12-17 21:53:22 +00:00
Warner Losh
359e4dba07 Revert r355831
It wasn't supposed to change the defaults, but actually does. Back this out
until that can be sorted out.
2019-12-17 04:21:35 +00:00
Warner Losh
0d83f8dc1f Implement bio_speedup
React to the BIO_SPEED command in the cam io scheduler by completing
as successful BIO_DELETE commands that are pending, up to the length
passed down in the BIO_SPEEDUP cmomand. The length passed down is a
hint for how much space on the drive needs to be recovered. By
completing the BIO_DELETE comomands, this allows the upper layers to
allocate and write to the blocks that were about to be trimmed. Since
FreeBSD implements TRIMSs as advisory, we can eliminliminate them and
go directly to writing.

The biggest benefit from TRIMS coomes ffrom the drive being able t
ooptimize its free block pool inthe log run. There's little nto no
bene3efit in the shoort term. , sepeciall whn the trim is followed by
a write. Speedup lets  us make this tradeoff.

Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351
2019-12-17 00:13:45 +00:00
Warner Losh
7918ea40a5 Eliminate the TRIM_ACTIVE flag.
Rather than a trim active flag, have a counter that can be used to
have a absolute limit on the number of trims in flight independent of
any I/O limiting factors.

Sponsored by: Netflix
2019-12-17 00:13:30 +00:00
Warner Losh
3aba1d47c8 Tweak the ddb show cam iosched command a bit.
For each of the different queue types, list the name of the
queue. While it can be worked out from context, this makes it more
useful and clearer.

Sponsored by: Netflix
2019-12-17 00:13:26 +00:00
Warner Losh
c6171b4440 Add rate limiters to TRIM.
Add rate limiters to trims. Trims are a bit different than reads or
writes in that they can be combined, so some care needs to be taken
where we rate limit them. Additional work will be needed to push the
working rate limit below the I/O quanta rate for things like IOPS.

Sponsored by: Netflix
2019-12-17 00:13:21 +00:00
Warner Losh
211b0f2dca NVME trim stuff.
Add two sysctls to control pacing of nvme
trims. kern.cam.nda.X.goal_trim is the number of upper layer
BIO_DEELETE requests to try to collecet before sending TRIM down too
the nvme drive. trim_ticks is the number of ticks, at mosot, to wait
for at least goal_trim BIOS_DELEETE requests to come in.

Trim pacing is useful when a large number off disjoint trims are
comoing in from the upper layers. Since we have no way to chain
toogether trims from the upper layers that are sent down, this acts as
a hueristic to group trims into reasonable sized chunks. What's
reasonable varies from drive to drive.

Sponsored by: Netflix
2019-12-17 00:11:48 +00:00
Warner Losh
83b75bb3cc Revert r355813
It was extracted from a larger tree and is incomplete. Will resubmit after
reworking.
2019-12-16 19:16:26 +00:00
Warner Losh
68e1c49a96 Implement a system-wide limit or da and ada devices for delete.
Excesively large TRIMs can result in timeouts, which cause big
problems. Limit trims to 1GB to mititgate these issues.

Reviewed by: scottl
Differential Revision: https://reviews.freebsd.org/D22809
2019-12-16 18:16:44 +00:00
John Baldwin
5773ac113c Use callout_func_t instead of the deprecated timeout_t.
Reviewed by:	kib, imp
Differential Revision:	https://reviews.freebsd.org/D22752
2019-12-10 22:06:53 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Alan Somers
e083fb08b9 ses: sanitize illegal strings in SES element descriptors
The SES4r3 standard requires that element descriptors may only contain ASCII
characters in the range 0x20 to 0x7e.  Some SuperMicro expanders violate
that rule.  This patch adds a sanity check to ses(4).  Descriptors in
violation will be replaced by "<invalid>".

This patch fixes "sesutil --libxo xml" on such systems.  Previously it would
generate non-well-formed XML output.

PR:		241929
Reviewed by:	allanjude
MFC after:	2 weeks
Sponsored by:	Axcient
2019-12-06 00:06:05 +00:00
Alexander Motin
61322a0a8a Mark some more hot global variables with __read_mostly.
MFC after:	1 week
2019-12-04 21:26:03 +00:00
Warner Losh
f86e60008b Regularize my copyright notice
o Remove All Rights Reserved from my notices
o imp@FreeBSD.org everywhere
o regularize punctiation, eliminate date ranges
o Make sure that it's clear that I don't claim All Rights reserved by listing
  All Rights Reserved on same line as other copyright holders (but not
  me). Other such holders are also listed last where it's clear.
2019-12-04 16:56:11 +00:00
Kenneth D. Merry
0c8f059c29 Fix a hang introduced in r351599.
My changes in 351599 (kindly committed by avg) made the cd(4) media check
asynchronous to avoid a sleep while holding a mutex.

There was a difficult to reproduce bug with those changes that caused a
hang on boot on some single processor machines/VMs.  Leandro Lupori
managed to reproduce the bug, diagnose it, and supplied a patch!  Here is
his analysis, from the PR:

======
I was able to reproduce the problem described in comment#14.

Actually, I wasn't trying to reproduce it, I just started seeing it a few
weeks ago, in CURRENT.

I can reproduce it consistently, by using QEMU to run a PowerPC64 VM with a
single core/thread (-smp 1).

It happens only when there is no media in the emulated CD-ROM, a device
that QEMU adds by default, unless -nodefaults is specified in command line.

I've debugged it and this is what I've found:

1- After the CD probe is successful, GEOM will try to open the device,
which will end up calling cdcheckmedia(), that sets CD state to
CD_STATE_MEDIA_PREVENT.
2- Next, scsi_prevent() is executed and succeeds, the CD_FLAG_DISC_LOCKED
flag is set and CD state moves to CD_STATE_MEDIA_SIZE.
3- Next, scsi_read_capacity() is executed and fails, state is set to
CD_STATE_MEDIA_ALLOW, cdmediaprobedone() is called and wakes up
cdcheckmedia().
4- Then, when cdstart() is invoked to process CD_STATE_MEDIA_ALLOW, it
first checks if CD_FLAG_DISC_LOCKED is set, and if so skips directly to
CD_STATE_MEDIA_SIZE state. This will repeat the steps of bullet 3, entering
an infinite MEDIA_SIZE command loop.

When there is a least another core/thread, the GEOM thread that performed
the initial cdopen() will get scheduled again, closing the CD device, that
will call cdprevent(PR_ALLOW) that clears the CD_FLAG_DISC_LOCKED flag and
breaks the loop.

So, apparently, the problem is CD_STATE_MEDIA_ALLOW being skipped when
CD_FLAG_DISC_LOCKED is set. If I understand correctly, in this case, the
state should be advanced to CD_STATE_MEDIA size only when the current state
is CD_STATE_MEDIA_PREVENT.
=====

PR:		kern/219857
Submitted by:	Leandro Lupori <leandro.lupori@gmail.com>
MFC after:	1 week
2019-12-02 19:57:39 +00:00
Alexander Motin
bae3729be4 Do not retry long ready waits if previous gave nothing.
I have some disks reporting "Logical unit is in process of becoming ready"
for about half an hour before finally reporting failure.  During that time
CAM waits for the readiness during ~2 minutes for each request, that makes
system boot take very long time.

This change reduces wait times for the following requests to ~1 second if
previously long wait for that device has timed out.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-11-22 21:31:59 +00:00
Kyle Evans
5b0a8ee218 MMCCAM: defer release of ccb until we're done with it
If we've found a device, we attempt to call xpt_action() on a ccb that's
already been released. Simply defer release until after we're done with it.

Reviewed by:	imp, scottl
MFC after:	1 week
2019-11-22 19:54:14 +00:00
Alexander Motin
7e8baf37e0 Remove xpt_lock mutex.
CAM does not require SIM locks for years, and obviously does not require
it for completely virtual XPT SIM.

MFC after:	2 weeks
2019-11-22 18:55:27 +00:00
Alexander Motin
a4876fbfc3 Make CAM use root_mount_hold_token() to delay boot.
Before this change CAM used config_intrhook_establish() for this purpose,
but that approach does not allow to delay it again after releasing once.

USB stack uses root_mount_hold() to delay boot until bus scan is complete.
But once it is, CAM had no time to scan SCSI bus, registered by umass(4),
if it already done other scans and called config_intrhook_disestablish().
The new approach makes it work smooth, assuming the USB device is found
during the initial bus scan.  Devices appearing on USB bus later may still
require setting kern.cam.boot_delay, but hopefully those are minority.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-11-22 18:39:51 +00:00
Scott Long
f0d6f5774a Remove NEEDGIANT from the scsi_sg /dev node. It likely has not been
needed for many years.

Reported by:	imp
2019-11-22 18:18:36 +00:00
Alexander Motin
cc453b2272 Set handling for some "Logical unit not ready" errors.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-11-20 20:00:03 +00:00
Warner Losh
02fa548cde Fix a race between daopen and damediapoll
When we do a daopen, we call dareprobe and wait for the results. The repoll runs
the da state machine up through the DA_STATE_RC* and then exits.

For removable media, we poll the device every 3 seconds with a TUR to see if it
has disappeared. This introduces a race. If the removable device has lots of
partitions, and if it's a little slow (like say a USB2 connected USB stick),
then we can have a fair amount of time that this reporbe is going on for. If,
during that time, damediapoll fires, it calls daschedule which changes the
scheduling priority from NONE to NORMAL. When that happens, the careful single
stepping in the da state machine is disrupted and we wind up sceduling multiple
read capacity calls. The first one succeeds and releases the reference. The
second one succeeds and releases the reference (and panics if the right code is
compiled into the da driver).

To avoid the race, only do the TUR calls while in state normal, otherwise just
reschedule damediapoll. This prevents the race from happening.
2019-11-13 01:58:43 +00:00
Warner Losh
45fceedf87 Add asserts for some state transitions
For the PROBEWP and PROBERC* states, add assertiosn that both the da device
state is in the right state, as well as the ccb state is the right one when we
enter dadone_probe{wp,rc}. This will ensure that we don't sneak through when
we're re-probing the size and write protection status of the device and thereby
leak a reference which can later lead to an invalidated peripheral going away
before all references are released (and resulting panic).

Reviewed by: scottl, ken
Differential Revision: https://reviews.freebsd.org/D22295
2019-11-11 17:36:57 +00:00
Warner Losh
dc1c17691e Update the softc state of the da driver before releasing the CCB.
There are contexts where releasing the ccb triggers dastart() to be run
inline. When da was written, there was always a deferral, so it didn't matter
much. Now, with direct dispatch, we can call dastart from the dadone*
routines. If the probe state isn't updated, then dastart will redo things with
stale information. This normally isn't a problem, because we run the probe state
machine once at boot... Except that we also run it for each open of the device,
which means we can have multiple threads racing each other to try to kick off
the probe. However, if we update the state before we release the CCB, we can
avoid the race. While it's needed only for the probewp and proberc* states, do
it everywhere because it won't hurt the other places.

The race here happens because we reprobe dozens of times on boot when drives
have lots of partitions.  We should consider caching this info for 1-2 seconds
to avoid this thundering hurd.

Reviewed by: scottl, ken
Differential Revision: https://reviews.freebsd.org/D22295
2019-11-11 17:36:52 +00:00
Warner Losh
fe95666bab Require and enforce that dareprobe() has to be called with the periph lock held.
Reviewed by: scottl, ken
Differential Revision: https://reviews.freebsd.org/D22295
2019-11-11 17:36:47 +00:00
Warner Losh
fb6ea34a3a Fix panic message to indicate right action that was improper.
Reviewed by: scottl, ken
Differential Revision: https://reviews.freebsd.org/D22295
2019-11-11 17:36:42 +00:00
Edward Tomasz Napierala
b5961be1ab Add GEOM attribute to report physical device name, and report it
via 'diskinfo -v'.  This avoids the need to track it down via CAM,
and should also work for disks that don't use CAM.  And since it's
inherited thru the GEOM hierarchy, in most cases one doesn't need
to walk the GEOM graph either, eg you can use it on a partition
instead of disk itself.

Reviewed by:	allanjude, imp
Sponsored by:	Klara Inc
Differential Revision:	https://reviews.freebsd.org/D22249
2019-11-09 17:30:19 +00:00
Alexander Motin
45577133ef Remove lock from CTL camsim frontend.
CAM does not need a SIM lock for quite a while, and CTL never needed it.

MFC after:	2 weeks
2019-11-03 00:13:23 +00:00
Brooks Davis
93489854f4 nda(4): Remove unnecessary union and avoid Clang -Wsizeof-array-divwarning
Clang trunk recently gained this new warning, and complains about the
sizeof(trim->data) / sizeof(struct nvme_dsm_range) expression, since
the left hand side's element type (char) does not match the right hand
side's type. The byte buffer is unnecessary so we can remove it to clean
up the code and fix the warning at the same time.

No functional change.

Submitted by:	James Clarke <jrtc27@jrtc27.com>
Reviewed by:	imp
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D21912
2019-10-24 22:23:53 +00:00
Alexander Motin
34a5c41c43 Add kern.cam.da.X.quirks tunable, similar existing for ada.
Submitted by:	Michael Lass
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20677
2019-09-26 14:48:39 +00:00
Alexander Motin
07f7e4c8b0 Fix assumptions of only one device per SES slot.
It is typical to have one, but no longer true for multi-actuator HDDs
with separate LUN for each actuator.

MFC after:	4 days
Sponsored by:	iXsystems, Inc.
2019-09-11 03:25:30 +00:00
Alexander Motin
16614d3518 Supply SAT layer with valid transfer sizes.
This is a rework of r344701, that noticed that number of bytes passes to
8 bit sector count field gets truncated.  First decision was to not pass
anything, since ATA specs define the field as N/A.  But it appeared to be a
problem for some SAT devices, that require information about data transfer
to operate properly.  Some additional investigation shown that it is quite
a common practice to set unused fields of ATA commands (fortunately ATA
specs formally allow it) to supply the information to SAT layer.  I have
found SAS-SATA interposer that does not allow pass-through without it.

As side effect, reduce code duplication by removing ata_do_28bit_cmd()
function, replacing it with more universal ata_do_cmd().

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-09-07 15:56:00 +00:00
Alexander Motin
6a216c0bb5 Take proper lock in ses_setphyspath_callback().
XPT_DEV_ADVINFO call should be protected by the lock of the specific
device it is addressed to, not the lock of SES device.  In some weird
case, probably with hardware violating standards, it sometimes caused
NULL dereference due to race.

To protect from it further, add lock assertion to *_dev_advinfo().

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-08-29 17:02:02 +00:00
Andriy Gapon
0093b755a7 scsi_cd: whitespace cleanup
Remove trailing whitespace and fix mixed indentation.

MFC after:	3 weeks
2019-08-29 08:26:40 +00:00
Andriy Gapon
c9f2918e69 scsi_cd: ifdef out cdsize()
It was used only by the old cdcheckmedia().

MFC after:	3 weeks
2019-08-29 08:19:11 +00:00
Andriy Gapon
dd78f43259 scsi_cd: make the media check asynchronous
This makes the media check process asynchronous, so we no longer block
in cdstrategy() to check for media.

PR:		219857
Obtained from:	ken
MFC after:	3 weeks
2019-08-29 07:51:11 +00:00
Alexander Motin
8d718012fe Always check cam_periph_error() status for ERESTART.
Even if we do not expect retries, we better be sure, since otherwise it
may result in use after free kernel panic.  I've noticed that it retries
SCSI_STATUS_BUSY even with SF_NO_RECOVERY | SF_NO_RETRY.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-08-27 16:41:06 +00:00
Alexander Motin
0912877616 Make camcontrol modepage support block descriptors.
It allows to read and write block descriptors alike to mode page parameters.
It allows to change block size or short-stroke HDDs or overprovision SSDs.
Depenting on -P parameter the change can be either persistent or till reset.
In case of block size change device may need reformat after the setting.
In case of SSD overprovisioning format or sanitize may be needed to really
free the flash.

During implementation appeared that csio_encode_visit() can not handle
integers of more then 4 bytes, that makes 8-byte LBA handling awkward.
I had to split it into two 4-byte halves now.

MFC after:	1 week
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
2019-08-07 14:45:10 +00:00
Alexander Motin
1173e5a721 Reenable UNMAP support on ramdisks by default.
For some reason, I guess just mechanical editing, it was disable in r333446.

MFC after:	2 weeks
2019-07-27 18:07:46 +00:00
Alexander Motin
4b9fba0cc5 Allow WRITE SAME handle more then 2^^32 blocks.
If not limited by write_same_max_lba option, split operation into several
2^^31 blocks chunks in a loop.  For large disks it may take a while, so
setting write_same_max_lba may be useful to avoid timeouts.

While there, fix build with CAM_CTL_DEBUG.

MFC after:	2 weeks
2019-07-27 17:27:26 +00:00
Alexander Motin
ed3bf01599 Add support for Long LBA mode parameter block descriptor.
It is formally required for SBC Base 2016 feature set.

MFC after:	2 weeks
2019-07-26 19:14:12 +00:00
Alexander Motin
ae8828bad1 Add device temperature reporting into CTL.
The values to report can be set via LUN options.  It can be useful for
testing, and also required for Drive Maintenance 2016 feature set.

MFC after:	2 weeks
2019-07-26 03:49:16 +00:00
Alexander Motin
0ea67e7019 Add reporting of SCSI Feature Sets VPD page from SPC-5.
CTL implements all defined feature sets except Drive Maintenance 2016,
which is not very applicable to such a virtual device, and implemented
only partially now.  But may be it could be fixed later at least for
completeness.

MFC after:	2 weeks
2019-07-26 01:49:28 +00:00
Alexander Motin
c15a591cbd Make camcontrol sanitize support also ATA devices.
ATA sanitize is functionally identical to SCSI, just uses different
initiation commands and status reporting mechanism.

While there, make kernel better handle sanitize commands and statuses.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-07-25 18:48:31 +00:00
Alexander Motin
76d843dab2 Make CAM ATA stack handle disk resizes.
While for ATA disks resize is even more rare situation than for SCSI, it
may happen in case of HPA or AMA being used.  Make ATA XPT report minor
IDENTIFY DATA change to upper layers with AC_GETDEV_CHANGED, and ada(4)
periph driver handle that event, recalculating all the disk properties and
signalling resize to GEOM.  Since ATA has no mechanism of UNIT ATTENTIONs,
like SCSI, it has no way to detect that something has changed.  That is why
this functionality depends on explicit reprobe via XPT_REPROBE_LUN call.

MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
2019-07-23 02:11:14 +00:00
Brooks Davis
c7bacdcc32 ata_xpt: Use the correct union member when accessing valid.
In principle this should not matter as it's a union and they point to
the same memory location but based on the code above we should be
accessing .sata and not .ata.

Submitted by:	arichardson
Reviewed by:	scottl, imp
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D21002
2019-07-22 21:07:58 +00:00
Alexander Motin
89b35a5274 Add Accessible Max Address Configuration support to camcontrol.
AMA replaced HPA in ACS-3 specification.  It allows to limit size of the
disk alike to HPA, but declares inaccessible data as indeterminate.  One
of its practical use cases is to under-provision SATA SSDs for better
reliability and performance.

While there, fix HPA Security detection/reporting.

MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
2019-07-19 19:15:08 +00:00
Mark Johnston
fc795c25d4 Remove the CDIOCREADSUBCHANNEL_SYSSPACE ioctl.
This was added for emulation of Linux's CDROMSUBCHNL, but allows
users with read access to a cd(4) device to overwrite kernel memory
provided that the driver detects some media present.

Reimplement CDROMSUBCHNL by bouncing the data from CDIOCREADSUBCHANNEL
through the linux_cdrom_subchnl structure passed from userspace.

admbugs:	768
Reported by:	Alex Fortune
Security:	CVE-2019-5602
Security:	FreeBSD-SA-19:11.cd_ioctl
2019-07-03 00:10:01 +00:00