Before this patch CAM periph drivers called both disk_alloc() and
disk_create() same time on periph creation. But then prevented disks
from opening until the periph probe completion with cam_periph_hold().
As result, especially if disk misbehaves during the probe, GEOM event
thread, triggered to taste the disk, got blocked on open attempt,
potentially for a long time, unable to process other events.
This patch moves disk_create() call from periph creation to the end of
the probe. To allow disk_create() calls from non-sleepable CAM contexts
some of its duties requiring memory allocations are moved either back
to disk_alloc() or forward to g_disk_create(), so now disk_alloc() and
disk_add_alias() are the only disk methods that require sleeping. If
disk fails during the probe disk_create() may just be skipped, going
directly to disk_destroy(). Other method calls during that time are
just ignored. Since GEOM may now see the disks after CAM bus scan is
already completed, introduce per-periph boot hold functions. Enclosure
driver already had such mechanism, so just generalize it.
Reviewed by: imp
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D35784
- In g_disk_start(), verify that the data to be written is initialized
according to KMSAN shadow state.
- In g_disk_done(), verify that the block driver updated shadow state as
expected, so as to catch sources of false positives early.
Sponsored by: The FreeBSD Foundation
Preallocate a geom_event (using the new geom_alloc_event) when we create
a disk. When we create the disk, we're going to be in a sleepable
context, so we can always allocate this extra bit of memory. Then use
this preallocated memory to free the disk. CAM can try to free the disk
from an unsleepable context if there was I/O outstanding when the disk
was destroyted (say because the SIM said it had gone away). The I/O
context isn't sleepable. Rather than trying to invent a retry mechanism
and making sure all the other geom_disk consumers did it properly,
preallocating the event ensure that the geom_disk will be properly torn
down, even when there's memory pressure when the disk departs.
Reviewd by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30544
Nothing implements this in the tree. Remove the ioctl and the
conversion to the geom atttribute stuff.
This was introduced in r94287 in 2002 and was retired in r113390
2003. It appeared in FreeBSD 5.0, but no other releases. This is a
vestige that was missed at the time and overlooked until now. No
compat is provided for this reason. And there's no implementation of
it today. And it was never part of a release from a stable branch.
Reviewed by: phk@
Differential Revision: https://reviews.freebsd.org/D26967
The alias needs to be part of the provider instead of the geom to work
properly. To bind the DEV geom, we need to look at the provider's names and
aliases and create the dev entries from there. If this lives in the GEOM, then
it won't propigate down the tree properly. Remove it from geom, add it provider.
Update geli, gmountver, gnop, gpart, and guzip to use it, which handles the bulk
of the uses in FreeBSD. I think this is all the providers that create a new name
based on their parent's name.
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
While some geom layers pass unknown commands down, not all do. For the ones that
don't, pass BIO_SPEEDUP down to the providers that constittue the geom, as
applicable. No changes to vinum or virstor because I was unsure how to add this
support, and I'm also unsure how to test these. gvinum doesn't implement
BIO_FLUSH either, so it may just be poorly maintained. gvirstor is for testing
and not supportig BIO_SPEEDUP is fine.
Reviewed by: chs
Differential Revision: https://reviews.freebsd.org/D23183
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places. It would be cool if we had
more SMP-friendly statistics, but this helps too.
Sponsored by: iXsystems, Inc.
pp->private just can not be NULL in those places.
In g_disk_start() and g_disk_ioctl() both dp != NULL and !dp->d_destroyed
should always be true if disk_gone() and disk_destroy() are used properly,
since GEOM does not send requests to errored providers. If the protocol is
not followed, then no amount of additional checks here give real safety.
In g_disk_access() though the checks are useful, since GEOM blocks only
new opens for errored providers, but allows closes. It should not happen
if disk_gone() and disk_destroy() are used properly, but may otherwise.
To improve cases when disk_gone() is not used, call it from disk_destroy().
It does not give full guaranties, but it errors the provider and makes
GEOM block unwanted requests at least after some race.
MFC after: 2 weeks
via 'diskinfo -v'. This avoids the need to track it down via CAM,
and should also work for disks that don't use CAM. And since it's
inherited thru the GEOM hierarchy, in most cases one doesn't need
to walk the GEOM graph either, eg you can use it on a partition
instead of disk itself.
Reviewed by: allanjude, imp
Sponsored by: Klara Inc
Differential Revision: https://reviews.freebsd.org/D22249
When it comes to megabytes of text, difference between sbuf_printf() and
sbuf_cat() becomes substantial.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Ths change consists of two parts.
geom_disk: deny opening a disk for writing if it's marked as
write-protected. A new disk(9) flag is added to mark write protected
disks. A possible alternative could be to add another parameter to d_open,
so that the open mode could be passed to it and the disk drivers could
make the decision internally, but the flag required less churn.
scsi_da: add a new phase of disk probing to query the all pages mode
sense page. We can determine if the disk is write protected using bit 7
of the device specific field in the mode parameter header returned by
MODE SENSE.
PR: 224037
Reviewed by: mav
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D13360
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Implement disk_add_alias to allow aliases to be added to disks. All
disk have a primary name (say "foo") can also have secondary names
(say "bar") such that all instances of "foo" also have a "bar"
alias. So if you have foo0, foo0p1, foo1, foo1s1 and foo1s1a nodes
created by the foo driver and gpart, device nodes bar0, bar0p1, bar1,
bar1s1 and bar1s1a will appear as symlinks back to the original nodes.
This generalizes to multiple aliases. However, since the unit number
follows the primary name, multiple device drivers can't create the
same aliases unless those drives coorinate the unit number space (eg
you couldn't add an alias 'disk' to both 'da' and 'ada' because it's
possible to have da0 and ada0, because 'disk0' is ambiguous).
Differential Revision: https://reviews.freebsd.org/D11873
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.
Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8366
report the size only after first opening. And due to the events are
asynchronous, some consumers can receive this event too late and
this confuses them. This partially restores previous behaviour, and
at the same time this should fix the problem, when already opened
provider loses resize event.
PR: 211028
MFC after: 3 weeks
when it is being opened. This should fix the possible loss of a resize
event when disk capacity changed.
PR: 211028
Reported by: Dexuan Cui <decui at microsoft dot com>
MFC after: 3 weeks
The GEOM disk d_mtx is only acquired on disk creation and destruction.
It is a good candidate for replacement with a pool mutex. This eliminates
the mutex initialization and teardown and the mutex and name variables
themselves from struct disk.
sys/geom/geom_disk.h:
Take d_mtx and d_mtx_name out of struct disk.
sys/geom/geom_disk.c:
Use mtx_pool_lock() and mtx_pool_unlock() to guard the disk
initialization state instead of a dedicated mutex.
This allows removing the initialization and destruction of
d_mtx.
sys/sys/param.h:
Bump __FreeBSD_version to 1100119 for the change to struct disk.
Suggested by: jhb
Sponsored by: Spectra Logic
Approved by: re (gjb)
device is gone.
The problem was that when disk_gone() is called, if the GEOM disk
creation process has not yet happened, the withering process
couldn't start.
We didn't record any state in the GEOM disk code, and so the d_gone()
callback to the da(4) driver never happened.
The solution is to track the state of the creation process, and
initiate the withering process from g_disk_create() if the disk is
being created.
This change does add fields to struct disk, and so I have bumped
DISK_VERSION.
geom_disk.c: Track where we are in the disk creation process,
and check to see whether our underlying disk has
gone away or not.
In disk_gone(), set a new d_goneflag variable that
g_disk_create() can check to see if it needs to
clean up the disk instance.
geom_disk.h: Add a mutex to struct disk (for internal use) disk
init level, and a gone flag.
Bump DISK_VERSION because the size of struct disk has
changed and fields have been added at the beginning.
Sponsored by: Spectra Logic
Approved by: re (marius)
This change includes support for SCSI SMR drives (which conform to the
Zoned Block Commands or ZBC spec) and ATA SMR drives (which conform to
the Zoned ATA Command Set or ZAC spec) behind SAS expanders.
This includes full management support through the GEOM BIO interface, and
through a new userland utility, zonectl(8), and through camcontrol(8).
This is now ready for filesystems to use to detect and manage zoned drives.
(There is no work in progress that I know of to use this for ZFS or UFS, if
anyone is interested, let me know and I may have some suggestions.)
Also, improve ATA command passthrough and dispatch support, both via ATA
and ATA passthrough over SCSI.
Also, add support to camcontrol(8) for the ATA Extended Power Conditions
feature set. You can now manage ATA device power states, and set various
idle time thresholds for a drive to enter lower power states.
Note that this change cannot be MFCed in full, because it depends on
changes to the struct bio API that break compatilibity. In order to
avoid breaking the stable API, only changes that don't touch or depend on
the struct bio changes can be merged. For example, the camcontrol(8)
changes don't depend on the new bio API, but zonectl(8) and the probe
changes to the da(4) and ada(4) drivers do depend on it.
Also note that the SMR changes have not yet been tested with an actual
SCSI ZBC device, or a SCSI to ATA translation layer (SAT) that supports
ZBC to ZAC translation. I have not yet gotten a suitable drive or SAT
layer, so any testing help would be appreciated. These changes have been
tested with Seagate Host Aware SATA drives attached to both SAS and SATA
controllers. Also, I do not have any SATA Host Managed devices, and I
suspect that it may take additional (hopefully minor) changes to support
them.
Thanks to Seagate for supplying the test hardware and answering questions.
sbin/camcontrol/Makefile:
Add epc.c and zone.c.
sbin/camcontrol/camcontrol.8:
Document the zone and epc subcommands.
sbin/camcontrol/camcontrol.c:
Add the zone and epc subcommands.
Add auxiliary register support to build_ata_cmd(). Make sure to
set the CAM_ATAIO_NEEDRESULT, CAM_ATAIO_DMA, and CAM_ATAIO_FPDMA
flags as appropriate for ATA commands.
Add a new get_ata_status() function to parse ATA result from SCSI
sense descriptors (for ATA passthrough over SCSI) and ATA I/O
requests.
sbin/camcontrol/camcontrol.h:
Update the build_ata_cmd() prototype
Add get_ata_status(), zone(), and epc().
sbin/camcontrol/epc.c:
Support for ATA Extended Power Conditions features. This includes
support for all features documented in the ACS-4 Revision 12
specification from t13.org (dated February 18, 2016).
The EPC feature set allows putting a drive into a power power mode
immediately, or setting timeouts so that the drive will
automatically enter progressively lower power states after various
idle times.
sbin/camcontrol/fwdownload.c:
Update the firmware download code for the new build_ata_cmd()
arguments.
sbin/camcontrol/zone.c:
Implement support for Shingled Magnetic Recording (SMR) drives
via SCSI Zoned Block Commands (ZBC) and ATA Zoned Device ATA
Command Set (ZAC).
These specs were developed in concert, and are functionally
identical. The primary differences are due to SCSI and ATA
differences. (SCSI is big endian, ATA is little endian, for
example.)
This includes support for all commands defined in the ZBC and
ZAC specs.
sys/cam/ata/ata_all.c:
Decode a number of additional ATA command names in ata_op_string().
Add a new CCB building function, ata_read_log().
Add ata_zac_mgmt_in() and ata_zac_mgmt_out() CCB building
functions. These support both DMA and NCQ encapsulation.
sys/cam/ata/ata_all.h:
Add prototypes for ata_read_log(), ata_zac_mgmt_out(), and
ata_zac_mgmt_in().
sys/cam/ata/ata_da.c:
Revamp the ada(4) driver to support zoned devices.
Add four new probe states to gather information needed for zone
support.
Add a new adasetflags() function to avoid duplication of large
blocks of flag setting between the async handler and register
functions.
Add new sysctl variables that describe zone support and paramters.
Add support for the new BIO_ZONE bio, and all of its subcommands:
DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP,
DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS.
sys/cam/scsi/scsi_all.c:
Add command descriptions for the ZBC IN/OUT commands.
Add descriptions for ZBC Host Managed devices.
Add a new function, scsi_ata_pass() to do ATA passthrough over
SCSI. This will eventually replace scsi_ata_pass_16() -- it
can create the 12, 16, and 32-byte variants of the ATA
PASS-THROUGH command, and supports setting all of the
registers defined as of SAT-4, Revision 5 (March 11, 2016).
Change scsi_ata_identify() to use scsi_ata_pass() instead of
scsi_ata_pass_16().
Add a new scsi_ata_read_log() function to facilitate reading
ATA logs via SCSI.
sys/cam/scsi/scsi_all.h:
Add the new ATA PASS-THROUGH(32) command CDB. Add extended and
variable CDB opcodes.
Add Zoned Block Device Characteristics VPD page.
Add ATA Return SCSI sense descriptor.
Add prototypes for scsi_ata_read_log() and scsi_ata_pass().
sys/cam/scsi/scsi_da.c:
Revamp the da(4) driver to support zoned devices.
Add five new probe states, four of which are needed for ATA
devices.
Add five new sysctl variables that describe zone support and
parameters.
The da(4) driver supports SCSI ZBC devices, as well as ATA ZAC
devices when they are attached via a SCSI to ATA Translation (SAT)
layer. Since ZBC -> ZAC translation is a new feature in the T10
SAT-4 spec, most SATA drives will be supported via ATA commands
sent via the SCSI ATA PASS-THROUGH command. The da(4) driver will
prefer the ZBC interface, if it is available, for performance
reasons, but will use the ATA PASS-THROUGH interface to the ZAC
command set if the SAT layer doesn't support translation yet.
As I mentioned above, ZBC command support is untested.
Add support for the new BIO_ZONE bio, and all of its subcommands:
DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP,
DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS.
Add scsi_zbc_in() and scsi_zbc_out() CCB building functions.
Add scsi_ata_zac_mgmt_out() and scsi_ata_zac_mgmt_in() CCB/CDB
building functions. Note that these have return values, unlike
almost all other CCB building functions in CAM. The reason is
that they can fail, depending upon the particular combination
of input parameters. The primary failure case is if the user
wants NCQ, but fails to specify additional CDB storage. NCQ
requires using the 32-byte version of the SCSI ATA PASS-THROUGH
command, and the current CAM CDB size is 16 bytes.
sys/cam/scsi/scsi_da.h:
Add ZBC IN and ZBC OUT CDBs and opcodes.
Add SCSI Report Zones data structures.
Add scsi_zbc_in(), scsi_zbc_out(), scsi_ata_zac_mgmt_out(), and
scsi_ata_zac_mgmt_in() prototypes.
sys/dev/ahci/ahci.c:
Fix SEND / RECEIVE FPDMA QUEUED in the ahci(4) driver.
ahci_setup_fis() previously set the top bits of the sector count
register in the FIS to 0 for FPDMA commands. This is okay for
read and write, because the PRIO field is in the only thing in
those bits, and we don't implement that further up the stack.
But, for SEND and RECEIVE FPDMA QUEUED, the subcommand is in that
byte, so it needs to be transmitted to the drive.
In ahci_setup_fis(), always set the the top 8 bits of the
sector count register. We need it in both the standard
and NCQ / FPDMA cases.
sys/geom/eli/g_eli.c:
Pass BIO_ZONE commands through the GELI class.
sys/geom/geom.h:
Add g_io_zonecmd() prototype.
sys/geom/geom_dev.c:
Add new DIOCZONECMD ioctl, which allows sending zone commands to
disks.
sys/geom/geom_disk.c:
Add support for BIO_ZONE commands.
sys/geom/geom_disk.h:
Add a new flag, DISKFLAG_CANZONE, that indicates that a given
GEOM disk client can handle BIO_ZONE commands.
sys/geom/geom_io.c:
Add a new function, g_io_zonecmd(), that handles execution of
BIO_ZONE commands.
Add permissions check for BIO_ZONE commands.
Add command decoding for BIO_ZONE commands.
sys/geom/geom_subr.c:
Add DDB command decoding for BIO_ZONE commands.
sys/kern/subr_devstat.c:
Record statistics for REPORT ZONES commands. Note that the
number of bytes transferred for REPORT ZONES won't quite match
what is received from the harware. This is because we're
necessarily counting bytes coming from the da(4) / ada(4) drivers,
which are using the disk_zone.h interface to communicate up
the stack. The structure sizes it uses are slightly different
than the SCSI and ATA structure sizes.
sys/sys/ata.h:
Add many bit and structure definitions for ZAC, NCQ, and EPC
command support.
sys/sys/bio.h:
Convert the bio_cmd field to a straight enumeration. This will
yield more space for additional commands in the future. After
change r297955 and other related changes, this is now possible.
Converting to an enumeration will also prevent use as a bitmask
in the future.
sys/sys/disk.h:
Define the DIOCZONECMD ioctl.
sys/sys/disk_zone.h:
Add a new API for managing zoned disks. This is very close to
the SCSI ZBC and ATA ZAC standards, but uses integers in native
byte order instead of big endian (SCSI) or little endian (ATA)
byte arrays.
This is intended to offer to the complete feature set of the ZBC
and ZAC disk management without requiring the application developer
to include SCSI or ATA headers. We also use one set of headers
for ioctl consumers and kernel bio-level consumers.
sys/sys/param.h:
Bump __FreeBSD_version for sys/bio.h command changes, and inclusion
of SMR support.
usr.sbin/Makefile:
Add the zonectl utility.
usr.sbin/diskinfo/diskinfo.c
Add disk zoning capability to the 'diskinfo -v' output.
usr.sbin/zonectl/Makefile:
Add zonectl makefile.
usr.sbin/zonectl/zonectl.8
zonectl(8) man page.
usr.sbin/zonectl/zonectl.c
The zonectl(8) utility. This allows managing SCSI or ATA zoned
disks via the disk_zone.h API. You can report zones, reset write
pointers, get parameters, etc.
Sponsored by: Spectra Logic
Differential Revision: https://reviews.freebsd.org/D6147
Reviewed by: wblock (documentation)
sys/geom/geom_disk.c:
disk_attr_changed(): Generate a devctl event of type GEOM:<attr> for
every call.
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D5952
Add a new bp argument to g_disk_maxsegs(), and add a new function,
g_disk_maxsize() tha will properly determine the maximum I/O size for a
delete or non-delete bio.
Submitted by: will
MFC after: 1 week
Sponsored by: Spectra Logic
camdd(8) utility.
CCBs may be queued to the driver via the new CAMIOQUEUE ioctl, and
completed CCBs may be retrieved via the CAMIOGET ioctl. User
processes can use poll(2) or kevent(2) to get notification when
I/O has completed.
While the existing CAMIOCOMMAND blocking ioctl interface only
supports user virtual data pointers in a CCB (generally only
one per CCB), the new CAMIOQUEUE ioctl supports user virtual and
physical address pointers, as well as user virtual and physical
scatter/gather lists. This allows user applications to have more
flexibility in their data handling operations.
Kernel memory for data transferred via the queued interface is
allocated from the zone allocator in MAXPHYS sized chunks, and user
data is copied in and out. This is likely faster than the
vmapbuf()/vunmapbuf() method used by the CAMIOCOMMAND ioctl in
configurations with many processors (there are more TLB shootdowns
caused by the mapping/unmapping operation) but may not be as fast
as running with unmapped I/O.
The new memory handling model for user requests also allows
applications to send CCBs with request sizes that are larger than
MAXPHYS. The pass(4) driver now limits queued requests to the I/O
size listed by the SIM driver in the maxio field in the Path
Inquiry (XPT_PATH_INQ) CCB.
There are some things things would be good to add:
1. Come up with a way to do unmapped I/O on multiple buffers.
Currently the unmapped I/O interface operates on a struct bio,
which includes only one address and length. It would be nice
to be able to send an unmapped scatter/gather list down to
busdma. This would allow eliminating the copy we currently do
for data.
2. Add an ioctl to list currently outstanding CCBs in the various
queues.
3. Add an ioctl to cancel a request, or use the XPT_ABORT CCB to do
that.
4. Test physical address support. Virtual pointers and scatter
gather lists have been tested, but I have not yet tested
physical addresses or scatter/gather lists.
5. Investigate multiple queue support. At the moment there is one
queue of commands per pass(4) device. If multiple processes
open the device, they will submit I/O into the same queue and
get events for the same completions. This is probably the right
model for most applications, but it is something that could be
changed later on.
Also, add a new utility, camdd(8) that uses the asynchronous pass(4)
driver interface.
This utility is intended to be a basic data transfer/copy utility,
a simple benchmark utility, and an example of how to use the
asynchronous pass(4) interface.
It can copy data to and from pass(4) devices using any target queue
depth, starting offset and blocksize for the input and ouptut devices.
It currently only supports SCSI devices, but could be easily extended
to support ATA devices.
It can also copy data to and from regular files, block devices, tape
devices, pipes, stdin, and stdout. It does not support queueing
multiple commands to any of those targets, since it uses the standard
read(2)/write(2)/writev(2)/readv(2) system calls.
The I/O is done by two threads, one for the reader and one for the
writer. The reader thread sends completed read requests to the
writer thread in strictly sequential order, even if they complete
out of order. That could be modified later on for random I/O patterns
or slightly out of order I/O.
camdd(8) uses kqueue(2)/kevent(2) to get I/O completion events from
the pass(4) driver and also to send request notifications internally.
For pass(4) devcies, camdd(8) uses a single buffer (CAM_DATA_VADDR)
per CAM CCB on the reading side, and a scatter/gather list
(CAM_DATA_SG) on the writing side. In addition to testing both
interfaces, this makes any potential reblocking of I/O easier. No
data is copied between the reader and the writer, but rather the
reader's buffers are split into multiple I/O requests or combined
into a single I/O request depending on the input and output blocksize.
For the file I/O path, camdd(8) also uses a single buffer (read(2),
write(2), pread(2) or pwrite(2)) on reads, and a scatter/gather list
(readv(2), writev(2), preadv(2), pwritev(2)) on writes.
Things that would be nice to do for camdd(8) eventually:
1. Add support for I/O pattern generation. Patterns like all
zeros, all ones, LBA-based patterns, random patterns, etc. Right
Now you can always use /dev/zero, /dev/random, etc.
2. Add support for a "sink" mode, so we do only reads with no
writes. Right now, you can use /dev/null.
3. Add support for automatic queue depth probing, so that we can
figure out the right queue depth on the input and output side
for maximum throughput. At the moment it defaults to 6.
4. Add support for SATA device passthrough I/O.
5. Add support for random LBAs and/or lengths on the input and
output sides.
6. Track average per-I/O latency and busy time. The busy time
and latency could also feed in to the automatic queue depth
determination.
sys/cam/scsi/scsi_pass.h:
Define two new ioctls, CAMIOQUEUE and CAMIOGET, that queue
and fetch asynchronous CAM CCBs respectively.
Although these ioctls do not have a declared argument, they
both take a union ccb pointer. If we declare a size here,
the ioctl code in sys/kern/sys_generic.c will malloc and free
a buffer for either the CCB or the CCB pointer (depending on
how it is declared). Since we have to keep a copy of the
CCB (which is fairly large) anyway, having the ioctl malloc
and free a CCB for each call is wasteful.
sys/cam/scsi/scsi_pass.c:
Add asynchronous CCB support.
Add two new ioctls, CAMIOQUEUE and CAMIOGET.
CAMIOQUEUE adds a CCB to the incoming queue. The CCB is
executed immediately (and moved to the active queue) if it
is an immediate CCB, but otherwise it will be executed
in passstart() when a CCB is available from the transport layer.
When CCBs are completed (because they are immediate or
passdone() if they are queued), they are put on the done
queue.
If we get the final close on the device before all pending
I/O is complete, all active I/O is moved to the abandoned
queue and we increment the peripheral reference count so
that the peripheral driver instance doesn't go away before
all pending I/O is done.
The new passcreatezone() function is called on the first
call to the CAMIOQUEUE ioctl on a given device to allocate
the UMA zones for I/O requests and S/G list buffers. This
may be good to move off to a taskqueue at some point.
The new passmemsetup() function allocates memory and
scatter/gather lists to hold the user's data, and copies
in any data that needs to be written. For virtual pointers
(CAM_DATA_VADDR), the kernel buffer is malloced from the
new pass(4) driver malloc bucket. For virtual
scatter/gather lists (CAM_DATA_SG), buffers are allocated
from a new per-pass(9) UMA zone in MAXPHYS-sized chunks.
Physical pointers are passed in unchanged. We have support
for up to 16 scatter/gather segments (for the user and
kernel S/G lists) in the default struct pass_io_req, so
requests with longer S/G lists require an extra kernel malloc.
The new passcopysglist() function copies a user scatter/gather
list to a kernel scatter/gather list. The number of elements
in each list may be different, but (obviously) the amount of data
stored has to be identical.
The new passmemdone() function copies data out for the
CAM_DATA_VADDR and CAM_DATA_SG cases.
The new passiocleanup() function restores data pointers in
user CCBs and frees memory.
Add new functions to support kqueue(2)/kevent(2):
passreadfilt() tells kevent whether or not the done
queue is empty.
passkqfilter() adds a knote to our list.
passreadfiltdetach() removes a knote from our list.
Add a new function, passpoll(), for poll(2)/select(2)
to use.
Add devstat(9) support for the queued CCB path.
sys/cam/ata/ata_da.c:
Add support for the BIO_VLIST bio type.
sys/cam/cam_ccb.h:
Add a new enumeration for the xflags field in the CCB header.
(This doesn't change the CCB header, just adds an enumeration to
use.)
sys/cam/cam_xpt.c:
Add a new function, xpt_setup_ccb_flags(), that allows specifying
CCB flags.
sys/cam/cam_xpt.h:
Add a prototype for xpt_setup_ccb_flags().
sys/cam/scsi/scsi_da.c:
Add support for BIO_VLIST.
sys/dev/md/md.c:
Add BIO_VLIST support to md(4).
sys/geom/geom_disk.c:
Add BIO_VLIST support to the GEOM disk class. Re-factor the I/O size
limiting code in g_disk_start() a bit.
sys/kern/subr_bus_dma.c:
Change _bus_dmamap_load_vlist() to take a starting offset and
length.
Add a new function, _bus_dmamap_load_pages(), that will load a list
of physical pages starting at an offset.
Update _bus_dmamap_load_bio() to allow loading BIO_VLIST bios.
Allow unmapped I/O to start at an offset.
sys/kern/subr_uio.c:
Add two new functions, physcopyin_vlist() and physcopyout_vlist().
sys/pc98/include/bus.h:
Guard kernel-only parts of the pc98 machine/bus.h header with
#ifdef _KERNEL.
This allows userland programs to include <machine/bus.h> to get the
definition of bus_addr_t and bus_size_t.
sys/sys/bio.h:
Add a new bio flag, BIO_VLIST.
sys/sys/uio.h:
Add prototypes for physcopyin_vlist() and physcopyout_vlist().
share/man/man4/pass.4:
Document the CAMIOQUEUE and CAMIOGET ioctls.
usr.sbin/Makefile:
Add camdd.
usr.sbin/camdd/Makefile:
Add a makefile for camdd(8).
usr.sbin/camdd/camdd.8:
Man page for camdd(8).
usr.sbin/camdd/camdd.c:
The new camdd(8) utility.
Sponsored by: Spectra Logic
MFC after: 1 week
and the following r273143 commit, supposed to workaround introduced issue by
quite innocent-looking change.
While there is no clear understanding why, but r273143 is accused in data
corruption in some environments with high I/O load. I personally don't see
any problem in that commit, and possibly it is just a trigger to some other
bug somewhere, but better safe then sorry for now.
Requested by: scottl@
MFC after: 3 days
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.
The defined now safety requirements are:
- caller should not hold any locks and should be reenterable;
- callee should not depend on GEOM dual-threaded concurency semantics;
- on the way down, if request is unmapped while callee doesn't support it,
the context should be sleepable;
- kernel thread stack usage should be below 50%.
To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
- G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
- G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
- G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
- G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe. If any of requirements are not met, request is queued to
g_up or g_down thread same as before.
Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).
To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.
This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).
Sponsored by: iXsystems, Inc.
MFC after: 2 months