properly.
If there is garbage in the flags field, it can sometimes include a
set CDAI_FLAG_STORE flag, which may cause either an error or
perhaps result in overwriting the field that was intended to be
read.
sys/cam/cam_ccb.h:
Add a new flag to the XPT_DEV_ADVINFO CCB, CDAI_FLAG_NONE,
that callers can use to set the flags field when no store
is desired.
sys/cam/scsi/scsi_enc_ses.c:
In ses_setphyspath_callback(), explicitly set the
XPT_DEV_ADVINFO flags to CDAI_FLAG_NONE when fetching the
physical path information. Instead of ORing in the
CDAI_FLAG_STORE flag when storing the physical path, set
the flags field to CDAI_FLAG_STORE.
sys/cam/scsi/scsi_sa.c:
Set the XPT_DEV_ADVINFO flags field to CDAI_FLAG_NONE when
fetching extended inquiry information.
sys/cam/scsi/scsi_da.c:
When storing extended READ CAPACITY information, set the
XPT_DEV_ADVINFO flags field to CDAI_FLAG_STORE instead of
ORing it into a field that isn't initialized.
sys/dev/mpr/mpr_sas.c,
sys/dev/mps/mps_sas.c:
When fetching extended READ CAPACITY information, set the
XPT_DEV_ADVINFO flags field to CDAI_FLAG_NONE instead of
setting it to 0.
sbin/camcontrol/camcontrol.c:
When fetching a device ID, set the XPT_DEV_ADVINFO flags
field to CDAI_FLAG_NONE instead of 0.
sys/sys/param.h:
Bump __FreeBSD_version to 1100061 for the new XPT_DEV_ADVINFO
CCB flag, CDAI_FLAG_NONE.
Sponsored by: Spectra Logic
MFC after: 1 week
This change by 2-3 times improves performance of misaligned XCOPY and WUT
commands by avoiding unneeded read-modify-write cycles inside ZFS.
MFC after: 1 week
This change by 2-3 times improves performance of misaligned WRITE SAME
commands by avoiding unneeded read-modify-write cycles inside ZFS.
MFC after: 1 week
target iSCSI offload. Add mechanism to query maximum receive data segment
size supported by chosen hardware offload module, and use it in ctld(8)
to determine the value to advertise to the other side.
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
This VPD page is effectively an extension of the standard Inquiry
data page, and includes lots of additional bits.
This commit includes support for probing the page in the SCSI probe code,
and an additional request type for the XPT_DEV_ADVINFO CCB. CTL already
supports the Extended Inquiry page.
Support for querying this page in the sa(4) driver will come later.
sys/cam/scsi/scsi_xpt.c:
Probe the Extended Inquiry page, if the device supports it, and
return it in response to a XPT_DEV_ADVINFO CCB if it is requested.
sys/cam/scsi/cam_ccb.h:
Define a new advanced information CCB data type, CDAI_TYPE_EXT_INQ.
sys/cam/cam_xpt.c:
Free the extended inquiry data in a device when the device goes
away.
sys/cam/cam_xpt_internal.h:
Add an extended inquiry data pointer and length to struct cam_ed.
sys/sys/param.h
Bump __FreeBSD_version for the addition of the new
CDAI_TYPE_EXT_INQ advanced information type.
Sponsored by: Spectra Logic
MFC after: 1 week
While ctld(8) still does not allow multiple portal groups per target
to be configured, kernel should now be able to handle it.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
VMware returns BUSY status when storage has transient connectivity issues.
It is often better to wait and let VM admin fix the problem then crash.
Discussed with: ken
MFC after: 1 week
Replace iSCSI-specific LUN mapping mechanism with new one, working for any
ports. By default all ports are created without LUN mapping, exposing all
CTL LUNs as before. But, if needed, LUN mapping can be manually set on
per-port basis via ctladm. For its iSCSI ports ctld does it via ioctl(2).
The next step will be to teach ctld to work with FibreChannel ports also.
Respecting additional flexibility of the new mechanism, ctl.conf now allows
alternative syntax for LUN definition. LUNs can now be defined in global
context, and then referenced from targets by unique name, as needed. It
allows same LUN to be exposed several times via multiple targets.
While there, increase limit for LUNs per target in ctld from 256 to 1024.
Some initiators do not support LUNs above 255, but that is not our problem.
Discussed with: trasz
MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems, Inc.
sys/cam/scsi/scsi_all.h:
In struct scsi_extended_inquiry_data:
- Increase the length field to 2 bytes, as it is 2 bytes in SPC-4.
- Add bit definitions for the various Activiate Microcode actions.
- Add the Sequential Access Logical Block Protection support bit,
since we need that in the sa(4) driver. (For modifications
that will come later.)
- Add definitions for the various Multi I_T Nexus Microcode
Download modes.
sys/cam/ctl/ctl.c:
As of SPC-4, a single report of "REPORTED LUNS DATA HAS CHANGED"
is to be given per I_T nexus. Once it is reported, the unit
attention condition should be cleared for all LUNS attached to
an I_T nexus.
Previously that only happened when a REPORT LUNS command was
processed.
This behavior may be different (according to SAM-5) when the
UA_INTLCK_CTRL bits are non-zero in the control mode page but
CTL does not currently support that.
So, in view of the spec, whenever we report a LUN inventory
change unit attention, clear it on all LUNs for that
particular I_T nexus.
Add a new function, ctl_clear_ua() that will clear a unit
attention on all LUNs for the given I_T nexus.
One field in the extended inquiry data that we could potentially
report at some point is the maximum supported sense data length.
To do that, we would the SIM to report (via path inquiry
perhaps) how much sense data it is able to send.
Add comments to explain some of the bits that are set in the
Extended Inquiry VPD page.
Add a few comments to make it more clear which functions handle
various VPD pages.
Sponsored by: Spectra Logic
MFC after: 1 week
This could cause data corruption due to accessing wrong LUN in case of
retries on write errors. Failed writes were retried to read LUN.
MFC after: 3 days
Define it as an atomic uint32_t. These increments happen infrequently
enough for the atomic overhead to be a problem, and since they're now
independent atomics, they won't contend with xpt_lock_buses().
This counter is useful as a means of cheaply identifying whether any changes
have been made to the CAM peripheral list. Userland programs have no guarantee
that the counter won't change on them while being returned or while processing
the information, so they must be written accordingly.
Discussed with: ken, mav (in general)
MFC after: 1 week
Sponsored by: Spectra Logic
If we aggregated status sending with data move and got error, allow status
to be updated and resent again separately. Without this command may stuck
without status sent at all.
MFC after: 2 weeks
This includes a new summary mode (-s) for camcontrol defects that
quickly tells the user the most important thing: how many defects
are in the requested list. The actual location of the defects is
less important.
Modern drives frequently have more than the 8191 defects that can
be reported by the READ DEFECT DATA (10) command. If they don't
have that many grown defects, they certainly have more than 8191
defects in the primary (i.e. factory) defect list.
The READ DEFECT DATA (12) command allows for longer parameter
lists, as well as indexing into the list of defects, and so allows
reporting many more defects.
This has been tested with HGST drives and Seagate drives, but
does not fully work with Seagate drives. Once I have a Seagate
spec I may be able to determine whether it is possible to make it
work with Seagate drives.
scsi_da.h: Add a definition for the new long block defect
format.
Add bit and mask definitions for the new extended
physical sector and bytes from index defect
formats.
Add a prototype for the new scsi_read_defects() CDB
building function.
scsi_da.c: Add a new scsi_read_defects() CDB building function.
camcontrol(8) was previously composing CDBs manually.
This is long overdue.
camcontrol.c: Revamp the camcontrol defects subcommand. We now
go through multiple stages in trying to get defect
data off the drive while avoiding various drive
firmware quirks.
We start off by requesting the defect header with
the 10 byte command. If we're in summary mode (-s)
and the drive reports fewer defects than can be
represented in the 10 byte header, we're done.
Otherwise, we know that we need to issue the
12 byte command if the drive reports the maximum
number of defects.
If we're in summary mode, we're done if we get a
good response back when asking for the 12 byte header.
If the user has asked for the full list, then we
use the address descriptor index field in the 12
byte CDB to step through the list in 64K chunks.
64K is small enough to work with most any ancient
or modern SCSI controller.
Add support for printing the new long block defect
format, as well as the extended physical sector and
bytes from index formats. I don't have any drives
that support the new formats.
Add a hexadecimal output format that can be turned
on with -X.
Add a quiet mode (-q) that can be turned on with
the summary mode (-s) to just print out a number.
Revamp the error detection and recovery code for
the defects command to work with HGST drives.
Call the new scsi_read_defects() CDB building
function instead of rolling the CDB ourselves.
Pay attention to the residual from the defect list
request when printing it out, so we don't run off
the end of the list.
Use the new scsi_nv library routines to convert
from strings to numbers and back.
camcontrol.8: Document the new defect formats (longblock, extbfi,
extphys) and command line options (-q, -s, -S and
-X) for the defects subcommand.
Explain a little more about what drives generally
do and don't support.
Sponsored by: Spectra Logic
MFC after: 1 week
data to go undetected.
The probe code does an MD5 checksum of the inquiry data (and page
0x80 serial number if available) before doing a reprobe of an
existing device, and then compares a checksum after the probe to
see whether the device has changed.
This check was broken in January, 2000 by change 56146 when the extended
inquiry probe code was added.
In the extended inquiry probe case, it was calculating the checksum
a second time. The second time it included the updated inquiry
data from the short inquiry probe (first 36 bytes). So it wouldn't
catch cases where the vendor, product, revision, etc. changed.
This change will have the effect that when a device's inquiry data is
updated and a rescan is issued, it will disappear and then reappear.
This is the appropriate action, because if the inquiry data or serial
number changes, it is either a different device or the device
configuration may have changed significantly. (e.g. with updated
firmware.)
scsi_xpt.c: Don't calculate the initial MD5 checksum on
standard inquiry data and the page 0x80 serial
number if we have already calculated it.
MFC after: 1 week
Sponsored by: Spectra Logic
While in most cases CTL should correctly fetch those values from backing
storages, there are some initiators (like MS SQL), that may not like large
physical block sizes, even if they are true. For such cases allow override
fetched values with supported ones (like 4K).
MFC after: 1 week
While we don't support MCS, hole in received sequence numbers may mean
only PDU loss. While we don't support lost PDU recovery, terminate the
connection to avoid stuck commands.
While there, improve handling of sequence numbers wrap after 2^32 PDUs.
MFC after: 2 weeks
Technically read requests can be executed in any order or simultaneously
since they are not changing any data. But ZFS prefetcher goes crasy when
it receives consecutive requests from different threads. Since prefetcher
works on level of separate blocks, instead of two consecutive 128K requests
it may receive 32 8K requests in mixed order.
This patch is more workaround then a real fix, and it does not fix all of
prefetcher problems, but it improves sequential read speed by 3-4x times
in some configurations. On the other side it may hurt performance if
some backing store has no prefetch, that is why it is disabled by default
for raw devices.
MFC after: 2 weeks
While without UNMAP support there is not much initiator can do about it,
the administrator still better be notified about the storage overflow.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Previously it was supported only for ZVOL-backed LUNs, but now should work
for file-backed LUNs too. Used value in this case is a space occupied by
the backing file, while available value is an available space on file
system. Pool thresholds are still not implemented in this case.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
It is implemented for LUNs backed by ZVOLs in "dev" mode and files.
GEOM has no such API, so for LUNs backed by raw devices all LBAs will
be reported as mapped/unknown.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
After recent optimizations this change is no longer blocked by CTL memory
consumption. Those limits are still not free, but much cheaper now.
MFC after: 1 week
Relnotes: yes
Sponsored by: iXsystems, Inc.
Abusing ability of major UAs cover minor ones we may not account UAs for
inactive ports. Allocate UAs storage for port and start accounting only
after some initiator from that port fetched its first POWER ON OCCURRED.
This reduces per-LUN CTL memory usage from >1MB to less then 100K.
MFC after: 1 month
In configurations with many ports, like iSCSI, each LUN is typically
accessed only by limited subset of ports. Allocating that memory on
demand allows to reduce CTL memory usage from 5.3MB/LUN to 1.3MB/LUN.
MFC after: 1 month
interpreating NULLs as EOLs, but converting them to spaces.
SPC-4 does not tell that T10-based IDs should be NULL-terminated/padded.
And while it tells that it should include only ASCII chars (0x20-0x7F),
there are some USB sticks (SanDisk Ultra Fit), that have NULLs inside
the value. Treating NULLs as EOLs there made those LUN IDs non-unique.
MFC after: 1 week
Make CTL core and block backend set success status before initiating last
data move for read commands. Make CAM target and iSCSI frontends detect
such condition and send command status together with data. New I/O flag
allows to skip duplicate status sending on later fe_done() call.
For Fibre Channel this change saves one of three interrupts per read command,
increasing performance from 126K to 160K IOPS. For iSCSI this change saves
one of three PDUs per read command, increasing performance from 1M to 1.2M
IOPS.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Old allocator created significant lock congestion protecting its lists
of preallocated I/Os, while UMA provides much better SMP scalability.
The downside of UMA is lack of reliable preallocation, that could guarantee
successful allocation in non-sleepable environments. But careful code
review shown, that only CAM target frontend really has that requirement.
Fix that making that frontend preallocate and statically bind CTL I/O for
every ATIO/INOT it preallocates any way. That allows to avoid allocations
in hot I/O path. Other frontends either may sleep in allocation context
or can properly handle allocation errors.
On 40-core server with 6 ZVOL-backed LUNs and 7 iSCSI client connections
this change increases peak performance from ~700K to >1M IOPS! Yay! :)
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Previously, any timeout value for which (timeout * hz) will overflow the
signed integer, will give weird results, since callout(9) routines will
convert negative values of ticks to '1'. For unsigned integer overflow we
will get sufficiently smaller timeout values than expected.
Switch from callout_reset, which requires conversion to int based ticks
to callout_reset_sbt to avoid this.
Also correct isci to correctly resolve ccb timeout.
This was based on the original work done by Eygene Ryabinkin
<rea@freebsd.org> back in 5 Aug 2011 which used a macro to help avoid
the overlow.
Differential Revision: https://reviews.freebsd.org/D1157
Reviewed by: mav, davide
MFC after: 1 month
Sponsored by: Multiplay
In this mode one head is in Active state, supporting all commands, while
another is in Standby state, supporting only minimal LUN discovery subset.
It is still incomplete since Standby state requires reservation support,
which is impossible to do right without having interlink between heads.
But it allows to run some basic experiments.
related cleanups:
- Require each driver to initalize a mutex in the scsi_low_softc that
is shared with the scsi_low code. This mutex is used for CAM SIMs,
timers, and interrupt handlers.
- Replace the osdep function switch with direct calls to the relevant
CAM functions and direct manipulation of timers via callout(9).
- Collapse the CAM-specific scsi_low_osdep_interface substructure
directly into scsi_low_softc.
- Use bus_*() instead of bus_space_*().
- Return BUS_PROBE_DEFAULT from probe routines instead of 0.
- No need to zero softcs.
- Pass 0ul and ~0ul instead of 0 and ~0 to bus_alloc_resource().
- Spell "dettach" as "detach".
- Remove unused 'dvname' variables.
- De-spl().
Tested by: no one
With command serialization used in CTL, there are no other commands to abort
when PREEMPT AND ABORT gets to run, so it is practically equal to PREEMPT.
MFC after: 1 week
For ZVOL-backed LUNs this allows to inform initiators if storage's used or
available spaces get above/below the configured thresholds.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This makes VMWare VAAI Thin Provisioning Stun primitive activate, pausing
the virtual machine, when backing storage (ZFS pool) is getting overflowed.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
This prevents BIO_DELETE requests getting stuck in the TRIM queue which
results in a panic on shutdown due to outstanding requests.
PR: 194606
Reported by: Guido Falsi
Reviewed by: mav
MFC after: 3 days
Sponsored by: Multiplay
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe
This includes support for:
- Read-Write Error Recovery mode page;
- Informational Exceptions Control mode page;
- Logical Block Provisioning mode page;
- LOG SENSE command.
No real Informational Exceptions features yet. This is only a placeholder.
Sponsored by: iXsystems, Inc.
SPC-4 r2 allows to return empty defect list if the list is not supported.
We don't reallu support defect data lists, but this suppresses some errors.
MFC after: 1 week
Make this subcommand less FC-specific, reporting target and port addresses
in more generic way. Also make it report list of connected initiators in
unified way, working for both FC and iSCSI, and potentially others.
MFC after: 1 week
Queued async events handling in CAM opened race, that may lead to duplicate
AC_PATH_REGISTERED events delivery during boot. That was not happening
before r272935 because the driver was initialized later. After that change
it started create duplicate ports in CTL.
Target mode operation does not depend on the initiator mode scan process.
This change allows the target driver to attach earlier and receive some
async events (like AC_CONTRACT) that could be lost otherwise.
MFC after: 1 week
Such LUNs will be visible to initiators, but return "not ready" status
on media access commands. If backing storage become available later,
`ctladm modify ...` or `service ctld reload` can trigger its reopen.
It allows to push out some final data from the send queue to the socket
before its close. In particular, it increases chances for logout response
to be delivered to the initiator.
Before this change target could send R2T request for write transfer of any
size, that could violate iSCSI RFC, which allows initiator to limit maximum
R2T size by negotiating MaxBurstLength connection parameter.
Also report an error in case of write underflow, when initiator provides
less data than initiator expects. Previously in such case our target
sent R2T request for non-existing data, violating the RFC, and confusing
some initiators. SCSI specs don't explicitly define how write underflows
should be handled and there are different oppinions, but reporting error
is hopefully better then violating iSCSI RFC with unpredictable results.
MFC after: 2 weeks
SPC-2 tells REPORT LUNS shall be supported by devices supporting LUNs other
then LUN 0. If we see LUN 0 disconnected, guess there may be others, and
so REPORT LUNS shall be supported.
MFC after: 1 month
Previous logic was not differentiating disconnected LUNs and absent targets.
That made it to stop scan if LUN 0 was not found for any reason. That made
problematic, for example, using iSCSI targets declaring SPC-2 compliance and
having no LUN 0 configured.
The new logic continues sequential LUN scan if:
-- we have more configured LUNs that need recheck;
-- this LUN is connected and its SCSI version allows more LUNs;
-- this LUN is disconnected, its SCSI version allows more LUNs and we
guess they may be connected (we haven't scanned first 8 LUNs yet or
kern.cam.cam_srch_hi sysctl is set to scan more).
Reported by: trasz
MFC after: 1 month
or I_T NEXUS LOSS, clear all minor UAs for the LUN, redundant in this case.
All SAM specifications tell that target MAY do it, but libiscsi initiator
seems require it to be done, terminating connection with error if some more
UAs happen to be reported during iSCSI connection.
MFC after: 3 days
Using pointer from the cdev directly is dangerous since we have no reference
on it, and it may change any time. That caused panic if device has gone.
While there, report capacity change only if it really changed.
MFC after: 3 days
Without clustering support we any way have only one group of permanently
active ports, but that gives us one more supported VMWare feature. ;)
Solaris' Comstar also reports it even when only one port is present.
devq_openings counter lost its meaning after allocation queues has gone.
held counter is still meaningful, but problematic to update due to separate
locking of CCB allocation and queuing.
To fix that replace devq_openings counter with allocated counter. held is
now calculated on request as difference between number of allocated, queued
and active CCBs.
MFC after: 1 month
It allows to bypass range checks between UNMAP and READ/WRITE commands,
which may introduce additional delays while waiting for UNMAP parameters.
READ and WRITE commands are always processed in safe order since their
range checks are almost free.
Before this change UNMAP completely blocked other I/Os while running.
Now it blocks only colliding ones, slowing down others only due to ZFS
locks collisions.
Sponsored by: iXsystems, Inc.
None of existing STEC devices need UNMAP or even support it well, having
many limitations and even hanging sometimes executing those commands.
New devices that may use UNMAP going to be released under HGST name.
MFC after: 3 days
kern.cam.ctl.iscsi.ping_timeout to 0. This fixes interoperability with
some initiators that don't properly support NOP-Ins, namely iPXE/gPXE.
Submitted by: Chen Wen <pokkys@gmail.com>
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
At this moment it works only for files and ZVOLs in device mode since BIOs
have no respective respective cache control flags (DPO/FUA).
MFC after: 1 month
Sponsored by: iXsystems, Inc.
that's ATAPI specific. Instead, skip to PROBE_SET_MULTI instead for
non ATAPI protocols. The prior code incorrectly terminated the probe
with a break, rather than arranging for probedone to get called. This
caused panics or worse on some systems.
that's only mostly similar. Specifically word 78 bits are defined for
IDENTIFY DEVICE as
5 Supports Hardware Feature Control
while a IDENTIFY PACKET DEVICE defines them as
5 Asynchronous notification supported
Therefore, only pay attention to bit 5 when we're talking to ATAPI
devices (we don't use the hardware feature control at this time).
Ignore it for ATA devices. Remove kludge that papered over this issue
for Samsung SATA SSDs, since Micron drives also have the bit set and
the error was caused by this bad interpretation of the spec (which is
quite easy to do, since bits aren't normally overlapping like this).
sizeof(struct scsi_inquiry_data) of 256 bytes combined with off-by-one
error in the changed code gave total INQUIRY data length above 255 bytes,
that was maximal INQUIRY length in SPC-2. While SPC-3 increased the
maximal length to 64K, at least sg3_utils are still confused by that.
MFC after: 1 week
This allows to avoid extra network traffic when copying files on NTFS iSCSI
disks within one storage host by drag'n'dropping them in Windows Explorer
of Windows 8/2012. It should also accelerate Hyper-V VM operations, etc.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
left some of the decisions based on its counterpart, SA_CCB_BUFFER_IO
being random. As a result, propagation of the residual information
for the SPACE command was broken, so the number of filemarks
encountered during a SPACE operation was miscalculated. Consequently,
systems relying on properly tracked filemark counters (like Bacula)
fell apart.
The change also removes a switch/case in sadone() which r256843
degraded to a single remaining case label.
PR: 192285
Approved by: ken
MFC after: 2 weeks
Unlike disk devices ZVOLs process all requests synchronously. That makes
impossible sending multiple requests to them from single thread. From the
other side ZVOLs have real d_read/d_write methods, which unlike d_strategy
can handle uio scatter/gather and have no strict I/O size limitations.
So, if ZVOL in "dev" mode is detected, use of d_read/d_write methods instead
of d_strategy allows to avoid pointless splitting of large requests into
MAXPHYS (128K) sized chunks.
MFC after: 1 week
link_elf_obj: symbol icl_pdu_new_bhs undefined
PR: 192031
Submitted by: Nils Beyer (earlier version)
MFC after: 3 days
Sponsored by: FreeBSD Foundation
After I gave each iSCSI target its own port, the old limit appeared to be
not so big. This change almost proportionally increases per-LUN memory
use, but it is still three times better then it was before r268807.
MFC after: 2 weeks
CTL never had use for CA support code since SPI has gone, and there is no
even frontends supporting that. But it still was reserving 256 bytes of
memory per LUN per every possible initiator on every possible port.
Wrap unused code with ifdef's in case somebody even need it.
MFC after: 2 weeks
This allows to clone VMs and move them between LUNs inside one storage
host without generating extra network traffic to the initiator and back,
and without being limited by network bandwidth.
LUNs participating in copy operation should have UNIQUE NAA or EUI IDs set.
For LUNs without these IDs VMWare will use traditional copy operations.
Beware: the above LUN IDs explicitly set to values non-unique from the VM
cluster point of view may cause data corruption if wrong LUN is addressed!
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
That should make operation more kind to multi-initiator environment.
Without this, other initiators may find out that something bad happened
to their commands only via command timeout.
Testing shown that both original queued design with separate task queue,
and recent direct execution design had significant flaw: If abort request
arrives just after the victim, the last one may not be in the ooa_queue
yet, and so invisible for the task management function.
Unlike original queued implementation, use same queue for all SCSI and
TASK requests from the same initiator. That avoids races between them:
task functions are always executed in proper time, relatively to other
requests.
If port passed negative IID value, the function will try to allocate IID
from the pool of unused, based on passed wwpn or name arguments. It does
all its best to make IID unique and persistent across reconnects.
This makes persistent reservation properly work for iSCSI. Previously,
in case of reconnects, reservation could be unexpectedly lost, or even
migrate between intiators.
teardown, and new port creation during `service ctld restart`.
Close it by returning iSCSI port internal state, that allows to identify
dying ports, which should not be counted as existing, from really alive.
Instead make ports provide wanted port and target IDs, and LUNs provide
wanted LUN IDs. After that core Device ID VPD code only had to link all
of them together and add relative port and port group numbers.
LUN ID for iSCSI LUNs no longer created by CTL, but by ctld, and passed
to CTL as "scsiname" LUN option. This makes LUNs to report the same set
of IDs, independently from the port through which it is accessed, as
required by SCSI specifications.
Having single port for all iSCSI connections makes problematic implementing
some more advanced SCSI functionality in CTL, that require proper ports
enumeration and identification.
This change extends CTL iSCSI API, making ctld daemon to control list of
iSCSI ports in CTL. When new target is defined in config fine, ctld will
create respective port in CTL. When target is removed -- port will be
also removed after all active commands through that port properly aborted.
This change require ctld to be rebuilt to match the kernel.
As a minor side effect, this allows to have iSCSI targets without LUNs.
While that may look odd and not very useful, that is not incorrect.
Before iSCSI implementation CTL had no knowledge about frontend drivers,
it had only frontends, which really were ports (alike to LUNs, if comparing
to backends). But iSCSI added there ioctl() method, which does not belong
to frontend as a port, but belongs to a frontend driver.
camcontrol(8) now supports a new 'persist' subcommand that allows users to
issue SCSI PERSISTENT RESERVE IN / OUT commands.
sbin/camcontrol/Makefile:
Add persist.c.
sbin/camcontrol/persist.c:
New persistent reservation support for camcontrol(8).
We have support for all known operation modes for PERSISTENT RESERVE
IN and PERSISTENT RESERVE OUT.
exceptions noted above.
sbin/camcontrol/camcontrol.8:
Document the new 'persist' subcommand.
In the section on the Transport ID (-I) option, explain what
Transport IDs for each protocol should look like. At some point
some of this information could probably get moved off in a
separate man page, either on Transport IDs alone or a man page
documenting the Transport ID parsing code.
Add a number of examples of persistent reservation commands.
Persistent Reservations are complex enough that the average user
probably won't be able to get the commands exactly right by just
reading the man page. These examples show a few basic and
advanced examples of how to use persistent reservations.
sbin/camcontrol/camcontrol.h:
Move the definition for camcontrol_optret here, so we can use it
for the persistent reservation code.
Add a definition for the new scsipersist() function.
sbin/camcontrol/camcontrol.c:
Add 'persist' to the list of subcommands.
Document 'persist' in the help text.
sys/cam/scsi/scsi_all.c:
Add the scsi_persistent_reserve_in() and
scsi_persistent_reserve_out() CCB building functions.
Add a new function, scsi_transportid_sbuf(). This takes a
SCSI Transport ID (documented in SPC-4), and prints it to
an sbuf(9). There are some transports (like ATA, USB, and
SSA) for which there is no transport defined. We need to
come up with a reasonable thing to do if we're presented
with a Transport ID that claims to be for one of those
protocols.
Add new routines scsi_get_nv() and scsi_nv_to_str().
These functions do a table lookup to go between a string and an
integer. There are lots of table lookups needed in the
persistent reservation code in camcontrol(8).
Add a new function, scsi_parse_transportid(), along with leaf node
functions to parse:
FC, 1394 and SAS (scsi_parse_transportid_64bit())
iSCSI (scsi_parse_transportid_iscsi())
SPI (scsi_parse_transportid_spi())
RDMA (scsi_parse_transportid_rdma())
PCIe (scsi_parse_transportid_sop())
Transport IDs. Given a string with the general form proto,id these
functions create a SCSI Transport ID structure.
sys/cam/scsi/scsi_all.h:
Update the various persistent reservation data structures to
SPC4r36l, but also rename some fields that were previously
obsolete with the proper names from older SCSI specs. This
allows using older, obsolete persistent reservation types when
desired.
Add function prototypes for the new persistent reservation CCB
building functions.
Add a data strucure for the READ FULL STATUS service action
of the PERSISTENT RESERVE IN command.
Add Transport ID structures for all protocols described in SPC-4.
Add a new series of SCSI_PROTO_XXX definitions, and
redefine other defines in terms of these new definitions.
Add a prototype for scsi_transportid_sbuf().
Change a couple of "obsolete" persistent reservation data
structure fields into something more meaningful, based on
what the field was called when it was defined in the spec.
(e.g. SPC, SPC-2, etc.)
Create a new define, SPRI_MAX_LEN, for the maximum allocation
length allowed for the PERSISTENT RESERVE IN command.
Add data structures and enumerations for the new name/value
translation functions.
Add data structures for SCSI over PCIe Routing IDs.
Bring the PERSISTENT RESERVE OUT Register and Move parameter list
structure (struct scsi_per_res_out_parms) up to date with SPC-4.
Add a data structure for the transport IDs that can optionally be
appended to the basic PERSISTENT RESERVE OUT parameter list.
Move SCSI protocol macro definitions out of the VPD page 0x83
definition and combine them with the more up to date protocol
definitions higher in the file.
Add function prototypes for scsi_nv_to_str(), scsi_get_nv(),
scsi_parse_transportid_64bit(), scsi_parse_transportid_spi(),
scsi_parse_transportid_rdma(), scsi_parse_transportid_iscsi(),
scsi_parse_transportid_sop(), and scsi_parse_transportid().
Sponsored by: Spectra Logic Corporation
MFC after: 1 week
requests on the trim_queue, even for the CFA ERASE. This allows us, in
the future, to collapse adjacent requests. Since CFA ERASE is only for
CF cards, and it is so restrictive in what it can do, the collapse
code is not presently here. This also brings the ada driver more in
line with the da driver's treatment of BIO_DELETEs.
Reviewed by: mav@
For every supported command define CDB length and mask of bits that are
allowed to be set. This allows to remove bunch of checks through the code
and still make the validation more strict. To properly do it for commands
supporting multiple service actions, formalize their parsing by adding
subtables for each of such commands.
As visible effect, this change allows to add support for REPORT SUPPORTED
OPERATION CODES command, reporting to client all the data about supported
SCSI commands, except timeouts.
MFC after: 2 weeks
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
Instead of trying to guess size of disk I/O operations (it just won't work
that way for newly added commands, and is equal to data move size for old
ones), account data move traffic. If disk I/Os are that interesting, then
backends have to account and provide that information.
Block backend already exports the information about disk I/Os via devstat,
so having it here too is excessive.
MFC after: 2 weeks
This gives some use to 512KB per-LUN buffers, allocated for Copan-specific
processor code and not used. It allows, for example, to test transport
performance and/or correctness without accessing the media, as supported
by Linux version of sg3_utils.
MFC after: 2 weeks
Split global ctl_lock, historically protecting most of CTL context:
- remaining ctl_lock now protects lists of fronends and backends;
- per-LUN lun_lock(s) protect LUN-specific information;
- per-thread queue_lock(s) protect request queues.
This allows to radically reduce congestion on ctl_lock.
Create multiple worker threads, depending on number of CPUs, and assign
each LUN to one of them. This allows to spread load between multiple CPUs,
still avoiging congestion on queues and LUNs locks.
On 40-core server, exporting 5 LUNs, each backed by gstripe of SATA SSDs,
accessed via 6 iSCSI connections, this change improves peak request rate
from 250K to 680K IOPS.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While for FreeBSD client that is only a minor optimization, VMWare client
doesn't support additional data requests after all data being sent once as
immediate.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
From one side it allows to remove CTL_FLAG_TASK_PENDING flag, handling of
which significantly complicates fine-grained locking. From the other side
it reduces task management requests latency even below then that flag could.
As downside, it denies task management code to sleep, but that is not needed
any way now.
Discussed with: ken
This should allow to abort commands doing mostly disk I/O, such as VERIFY
or WRITE SAME. Before this change CTL_FLAG_ABORT was only checked around
data moves, which for these commands may not happen for a very long time.
MFC after: 2 weeks
SPC-4 recommends T10 vendor ID based LUN ID was created by concatenating
product name and serial number (and istgt follows that). But product name
is 16 bytes long by itself, so 16 bytes total length is clearly not enough
to fit both.
To keep compatibility with existing configurations, pad short device IDs
to old length of 16, same as before.
This change probably breaks CTL user-level ABI, so control tools should
be rebuilt after this change.
MFC after: 2 weeks
we're now back to the pre-r228483 level of default verbosity. This in
turn again typically allows for reading information that userland might
have printed on the screen before initiating a halt, but still permits
to debug potential device shutdown problems on system shutdown via
CAM_DEBUG etc.
Reviewed by: mav
MFC after: 3 days
Sponsored by: Bally Wulff Games & Entertainment GmbH
for any outstanding commands to be properly aborted by CTL.
Without it, in some cases (such as files backing the LUNs
stored on failing disk drives), terminating a busy session
would result in panic.
Reviewed by: mav@ (earlier version)
Sponsored by: The FreeBSD Foundation
Make data_submit backends method support not only read and write requests,
but also two new ones: verify and compare. Verify just checks readability
of the data in specified location without transferring them outside.
Compare reads the specified data and compares them to received data,
returning error if they are different.
VERIFY(10/12/16) commands request either verify or compare from backend,
depending on BYTCHK CDB field. COMPARE AND WRITE command executed in two
stages: first it requests compare, and then, if succeesed, requests write.
Atomicity of operation is guarantied by CTL request ordering code.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
trims to the device assumes the list is sorted. Don't apply the
optimization of not sorting the queue when we have SSDs to the
delete_queue, since it causes more discard traffic to the drive. While
one could argue that the higher levels should coalesce the trims,
that's not done today, so some optimization at this level is needed.
CR: https://phabric.freebsd.org/D142
Make it really work for native FreeBSD programs. Before this it was broken
for years due to different number of pointer dereferences in Linux and
FreeBSD IOCTL paths, permanently returning errors to FreeBSD programs.
This change breaks the driver FreeBSD IOCTL ABI, making it more strict,
but since it was not working any way -- who bother.
Add shims for 32-bit programs on 64-bit host, translating the argument
of the SG_IO IOCTL for both FreeBSD and Linux ABIs.
With this change I was able to run 32-bit Linux sg3_utils tools and simple
32 and 64-bit FreeBSD test tools on both 32 and 64-bit FreeBSD systems.
MFC after: 1 month
Nobody yet reported disk supporting I/Os less then our MAXPHYS value, but
since we any way have code to read Block Limits VPD page, that is easy.
MFC after: 2 weeks
Instead of rereading VPD pages on every device open, do it only on initial
device probe, and in cases when device reported via UNIT ATTENTIONs that
something has changed. Capacity is still rereaded on every open because
it is more critical for operation and more probable to change in run time.
On my tests with Intel 530 SSDs on mps(4) HBA this change reduces time
GEOM needs to retaste the device (that includes few open/close cycles)
from ~150ms to ~30ms.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Instead of allocating up to 16MB or RAM at once to handle whole I/O,
allocate up to 1MB at a time, but do multiple ctl_datamove() and storage
I/Os if needed.
Unfortunately we can't check range collisions for UNMAP commands alike
to writes, because they include multiple ranges, which are also passed
in data block, not in CDB. As result, UNMAP commands have to be treated
as colliding with any other command accessing the media.
From the other side all UNMAPs are equal (we don't support ANCHOR flag),
so we can execute several UNMAPs same time.
This code was heavily broken few months ago during CAM locking changes.
Fixing it would require almost complete rewrite. Since there are no
known devices on market using this interface younger then ~15 years, and
they are CD, not even DVD, I don't see much reason to rewrite it.
This change does not mean those devices won't work. They will just work
slower due to inefficient disks load/unload schedule if several LUNs
accessed same time.
Discussed with: ken@
Silence on: scsi@, hardware@
MFC after: 1 week
This patch adds support for three new SCSI commands: UNMAP, WRITE SAME(10)
and WRITE SAME(16). WRITE SAME commands support both normal write mode
and UNMAP flag. To properly report UNMAP capabilities this patch also adds
support for reporting two new VPD pages: Block limits and Logical Block
Provisioning.
UNMAP support can be enabled per-LUN by adding "-o unmap=on" to `ctladm
create` command line or "option unmap on" to lun sections of /etc/ctl.conf.
At this moment UNMAP supported for ramdisks and device-backed block LUNs.
It was tested to work great with ZFS ZVOLs. For file-backed LUNs UNMAP
support is unfortunately missing due to absence of respective VFS KPI.
Reviewed by: ken
MFC after: 1 month
Sponsored by: iXsystems, Inc
This avoids extra locking in icl_pdu_queue(); the upper layer needs to call
it while holding its own lock anyway, to avoid sending PDUs out of order.
Sponsored by: The FreeBSD Foundation
them for actual target errors. They can be enabled back by setting
kern.cam.ctl.verbose=1, or booting with bootverbose.
Sponsored by: The FreeBSD Foundation
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
- Logical sector size is measured in words, not bytes.
- If physical sector is not bigger then logical sector, it does not mean
it should be set equal to 512 bytes, but set to logical sector.
PR: misc/187269
Submitted by: Ravi Pokala <rpokala@panasas.com>
MFC after: 1 week