r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
handler to accept a poitner to a u_int. To make the type of the softc flags
stable and defined, make it a u_int. Cast the enum types to u_int for arg2 so
when passing to dabitsysctl it's a u_int.
Noticed by: emax@
Differential Revision: https://reviews.freebsd.org/D23785
It's valid for a periph to be removed with outstanding transactions on the
device. In CAM, multiple periphs attach to a single device. There's no interlock
to prevent one of these going away while other periphs have outstanding CCBs and
it's not an error either. Remove this overly agressive KASSERT to prevent
false-positive panics when devices depart.
know that if there are any outstanding CCBs, then when they dereference the path
that's freed at the bottom of camperiphfree there will be some flavor of
panic. This moves that eventual panic to a traceback of when we free the last
reference on the device, which is earlier but may not be early enough.
Rotating and unmapped_io are really da flags. Convert them to a flag so it will
be reported with the other flags for the device. Deprecate the .rotating and
.unmapped_io sysctls in FreeBSD 14 and remove the softc ints.
Differential Revision: https://reviews.freebsd.org/D23417
Export the current flags. They can be useful to other programs wanting to do
special thigns for removable or similar devices.
Differential Revision: https://reviews.freebsd.org/D23417
BIO_READ and BIO_WRITE, we've handled this expanded syntax poorly in
drivers when the driver doesn't support a particular command. Do a
sweep and fix that.
Reported by: imp
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places. It would be cool if we had
more SMP-friendly statistics, but this helps too.
Sponsored by: iXsystems, Inc.
Since we are already using malloc()+copyin()/copyout() for smaller data
blocks, and since new asynchronous API does it always, I see no reason
to keep this ugly artificial size/alignment limitation in old API.
Tape applications suffer enough from the MAXPHYS limitations by itself,
and additional alignment requirement, often halving effectively usable
block size, does not help.
It would be good to use unmapped I/O here instead, but it require some
HBA drivers polishing first to support non-BIO unmapped buffers.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While it works on nda, it fails on ada and/or da for at least zfs with a modify
after free issue on a trim BIO. Revert while I rework it to fix those devices.
React to the BIO_SPEED command in the cam io scheduler by completing
as successful BIO_DELETE commands that are pending, up to the length
passed down in the BIO_SPEEDUP cmomand. The length passed down is a
hint for how much space on the drive needs to be recovered. By
completing the BIO_DELETE comomands, this allows the upper layers to
allocate and write to the blocks that were about to be trimmed. Since
FreeBSD implements TRIMSs as advisory, we can eliminliminate them and
go directly to writing.
The biggest benefit from TRIMS coomes ffrom the drive being able t
ooptimize its free block pool inthe log run. There's little nto no
bene3efit in the shoort term. , sepeciall whn the trim is followed by
a write. Speedup lets us make this tradeoff.
Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351
Rather than a trim active flag, have a counter that can be used to
have a absolute limit on the number of trims in flight independent of
any I/O limiting factors.
Sponsored by: Netflix
For each of the different queue types, list the name of the
queue. While it can be worked out from context, this makes it more
useful and clearer.
Sponsored by: Netflix
Add rate limiters to trims. Trims are a bit different than reads or
writes in that they can be combined, so some care needs to be taken
where we rate limit them. Additional work will be needed to push the
working rate limit below the I/O quanta rate for things like IOPS.
Sponsored by: Netflix
Add two sysctls to control pacing of nvme
trims. kern.cam.nda.X.goal_trim is the number of upper layer
BIO_DEELETE requests to try to collecet before sending TRIM down too
the nvme drive. trim_ticks is the number of ticks, at mosot, to wait
for at least goal_trim BIOS_DELEETE requests to come in.
Trim pacing is useful when a large number off disjoint trims are
comoing in from the upper layers. Since we have no way to chain
toogether trims from the upper layers that are sent down, this acts as
a hueristic to group trims into reasonable sized chunks. What's
reasonable varies from drive to drive.
Sponsored by: Netflix
Excesively large TRIMs can result in timeouts, which cause big
problems. Limit trims to 1GB to mititgate these issues.
Reviewed by: scottl
Differential Revision: https://reviews.freebsd.org/D22809
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
The SES4r3 standard requires that element descriptors may only contain ASCII
characters in the range 0x20 to 0x7e. Some SuperMicro expanders violate
that rule. This patch adds a sanity check to ses(4). Descriptors in
violation will be replaced by "<invalid>".
This patch fixes "sesutil --libxo xml" on such systems. Previously it would
generate non-well-formed XML output.
PR: 241929
Reviewed by: allanjude
MFC after: 2 weeks
Sponsored by: Axcient
o Remove All Rights Reserved from my notices
o imp@FreeBSD.org everywhere
o regularize punctiation, eliminate date ranges
o Make sure that it's clear that I don't claim All Rights reserved by listing
All Rights Reserved on same line as other copyright holders (but not
me). Other such holders are also listed last where it's clear.
My changes in 351599 (kindly committed by avg) made the cd(4) media check
asynchronous to avoid a sleep while holding a mutex.
There was a difficult to reproduce bug with those changes that caused a
hang on boot on some single processor machines/VMs. Leandro Lupori
managed to reproduce the bug, diagnose it, and supplied a patch! Here is
his analysis, from the PR:
======
I was able to reproduce the problem described in comment#14.
Actually, I wasn't trying to reproduce it, I just started seeing it a few
weeks ago, in CURRENT.
I can reproduce it consistently, by using QEMU to run a PowerPC64 VM with a
single core/thread (-smp 1).
It happens only when there is no media in the emulated CD-ROM, a device
that QEMU adds by default, unless -nodefaults is specified in command line.
I've debugged it and this is what I've found:
1- After the CD probe is successful, GEOM will try to open the device,
which will end up calling cdcheckmedia(), that sets CD state to
CD_STATE_MEDIA_PREVENT.
2- Next, scsi_prevent() is executed and succeeds, the CD_FLAG_DISC_LOCKED
flag is set and CD state moves to CD_STATE_MEDIA_SIZE.
3- Next, scsi_read_capacity() is executed and fails, state is set to
CD_STATE_MEDIA_ALLOW, cdmediaprobedone() is called and wakes up
cdcheckmedia().
4- Then, when cdstart() is invoked to process CD_STATE_MEDIA_ALLOW, it
first checks if CD_FLAG_DISC_LOCKED is set, and if so skips directly to
CD_STATE_MEDIA_SIZE state. This will repeat the steps of bullet 3, entering
an infinite MEDIA_SIZE command loop.
When there is a least another core/thread, the GEOM thread that performed
the initial cdopen() will get scheduled again, closing the CD device, that
will call cdprevent(PR_ALLOW) that clears the CD_FLAG_DISC_LOCKED flag and
breaks the loop.
So, apparently, the problem is CD_STATE_MEDIA_ALLOW being skipped when
CD_FLAG_DISC_LOCKED is set. If I understand correctly, in this case, the
state should be advanced to CD_STATE_MEDIA size only when the current state
is CD_STATE_MEDIA_PREVENT.
=====
PR: kern/219857
Submitted by: Leandro Lupori <leandro.lupori@gmail.com>
MFC after: 1 week
I have some disks reporting "Logical unit is in process of becoming ready"
for about half an hour before finally reporting failure. During that time
CAM waits for the readiness during ~2 minutes for each request, that makes
system boot take very long time.
This change reduces wait times for the following requests to ~1 second if
previously long wait for that device has timed out.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
If we've found a device, we attempt to call xpt_action() on a ccb that's
already been released. Simply defer release until after we're done with it.
Reviewed by: imp, scottl
MFC after: 1 week
Before this change CAM used config_intrhook_establish() for this purpose,
but that approach does not allow to delay it again after releasing once.
USB stack uses root_mount_hold() to delay boot until bus scan is complete.
But once it is, CAM had no time to scan SCSI bus, registered by umass(4),
if it already done other scans and called config_intrhook_disestablish().
The new approach makes it work smooth, assuming the USB device is found
during the initial bus scan. Devices appearing on USB bus later may still
require setting kern.cam.boot_delay, but hopefully those are minority.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
When we do a daopen, we call dareprobe and wait for the results. The repoll runs
the da state machine up through the DA_STATE_RC* and then exits.
For removable media, we poll the device every 3 seconds with a TUR to see if it
has disappeared. This introduces a race. If the removable device has lots of
partitions, and if it's a little slow (like say a USB2 connected USB stick),
then we can have a fair amount of time that this reporbe is going on for. If,
during that time, damediapoll fires, it calls daschedule which changes the
scheduling priority from NONE to NORMAL. When that happens, the careful single
stepping in the da state machine is disrupted and we wind up sceduling multiple
read capacity calls. The first one succeeds and releases the reference. The
second one succeeds and releases the reference (and panics if the right code is
compiled into the da driver).
To avoid the race, only do the TUR calls while in state normal, otherwise just
reschedule damediapoll. This prevents the race from happening.
For the PROBEWP and PROBERC* states, add assertiosn that both the da device
state is in the right state, as well as the ccb state is the right one when we
enter dadone_probe{wp,rc}. This will ensure that we don't sneak through when
we're re-probing the size and write protection status of the device and thereby
leak a reference which can later lead to an invalidated peripheral going away
before all references are released (and resulting panic).
Reviewed by: scottl, ken
Differential Revision: https://reviews.freebsd.org/D22295
There are contexts where releasing the ccb triggers dastart() to be run
inline. When da was written, there was always a deferral, so it didn't matter
much. Now, with direct dispatch, we can call dastart from the dadone*
routines. If the probe state isn't updated, then dastart will redo things with
stale information. This normally isn't a problem, because we run the probe state
machine once at boot... Except that we also run it for each open of the device,
which means we can have multiple threads racing each other to try to kick off
the probe. However, if we update the state before we release the CCB, we can
avoid the race. While it's needed only for the probewp and proberc* states, do
it everywhere because it won't hurt the other places.
The race here happens because we reprobe dozens of times on boot when drives
have lots of partitions. We should consider caching this info for 1-2 seconds
to avoid this thundering hurd.
Reviewed by: scottl, ken
Differential Revision: https://reviews.freebsd.org/D22295